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Abstract 

The  utilization  of  B-tree  nodes  determines  the  number  of  levels  in  the  B-tree  and  hence  its 
performance.  Yao  showed  that  the  utilization  of  B-tree  nodes  under  pure  inserts  was  69%.  We  derive 
analytically  and  verify  by  simulation  the  utilization  of  B-tree  nodes  constructed  from  a  mixture  of 
insert  and  delete  operations.  Assuming  that  nodes  only  merge  when  they  are  empty  we  show  that 
the  utilization  is  39%  when  the  number  of  inserts  is  the  same  as  the  number  of  deletes.  However,  if 
there  are  just  5%  more  inserts  than  deletes,  then  the  utilization  is  over  62%.  We  also  calculate  the 
probability  of  splitting  and  merging.  We  derive  a  simple  rule-of-thumb  that  accurately  calculates 
the  probability  of  splitting.  We  also  model  B-trees  that  merge  half-empty  nodes.  The  utilization  of 
merge-at-half  B-trees  is  slightly  larger  than  the  utilization  of  merge-at-empty  B-trees  (within  10%  if 
if  there  are  at  least  5%  more  inserts  than  deletes),  but  the  restructuring  rate  is  much  higher.  We 
present  two  models  for  computing  B-tree  utilization,  the  more  accurate  of  which  remembers  items 
inserted  and  then  deleted  in  a  node.  After  analyzing  the  leaf  level  of  a  random  B-tree,  we  analyze  the 
upper  levels,  and  show  that  they  have  the  structure  of  the  leaf  level  of  a  pure-insert  tree,  independent 
of  the  percentage  of  operations  that  are  deletes  (if  the  percentage  of  deletes  is  less  than  50%). 

1      Introduction 

B-trees  are  commonly  used  in  very  large  database  systems  to  provide  indices  into  the  datable.  The 
space  that  the  B-tree  index  consumes  can  be  considerable.  In  order  to  allocate  an  appropriate  amount 
of  space  for  the  B-tree,  an  estimate  of  the  B-tree  utilization  is  needed. 

In  this  paper,  we  will  derive  and  solve  equations  that  describe  the  equilibrium  structure  of  a  B-tree 
under  a  parameterized  mix  of  insert  and  delete  operations,  where  the  deletes  can  come  from  modify  oper- 
ations. Every  item  in  the  B-tree  is  equally  likely  to  be  deleted  on  a  delete  operation.  Every  permutation 
of  the  operands  of  the  insert  operations  is  equally  Ukely.  The  method  is  based  on  Yao's  [Yao  78]  except 
that  we  will  consider  the  solution  to  be  the  equilibrium  point  of  a  set  of  difference  equations.  We  will 
show  how  the  probability  that  a  node  receives  an  insert  can  be  calculated  by  keeping  track  of  the  number 
of  ghosts,  or  items  that  have  been  deleted,  that  are  in  the  B-tree. 

In  addition  to  the  utilization,  we  will  calculate  the  probabiUty  of  splitting  or  merging  a  node  on  the 
bottom  level.  After  analyzing  the  leaf  level  of  a  random  B-tree,  we  will  analyze  the  upper  levels  of  a 
B-tree.  We  will  use  this  analysis  to  predict  the  height  and  number  of  children  of  the  root. 

We  will  model  two  types  of  B-trees  in  this  paper.  The  first  type  allows  B-tree  nodes  to  become 
arbitrarily  small  (i.e.  contain  as  few  as  one  entry).  When  a  node  becomes  empty,  the  space  is  freed  and 
the  delete  is  propagated  upwards.  We  call  this  type  of  B-tree  a  merge-at-empty  B-tree.  Merge-at-empty 
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B-trees  are  commonly  used  in  database  management  systems.  The  second  type  of  B-tree  that  we  will 
model  merges  a  half-empty  node  with  its  neighbor  on  a  delete  (a  merye-a^/jaZ/B-tree).  We  will  compare 
the  two  types  of  B-trees  in  this  paper  and  conclude  that  database  management  systems  have  made  the 
right  decision. 

1.1  Simulation  Model 

We  wrote  a  B-tree  simulator  to  compare  against  our  analytical  results.  The  simulation  builds  a  B-tree 
out  of  a  sequence  of  inserts  and  deletes,  then  applies  a  long  sequence  of  parameterized  inserts  and  deletes. 
Every  item  in  the  B-tree  is  equally  likely  to  be  chosen  as  the  operand  of  a  delete  operation.  The  operands 
of  the  insert  operations  are  chosen  uniformly  randomly  from  a  large  key-space. 

1.2  Previous  Work 

Yao  derived  an  estimate  of  69%  utilization  for  pure  insert  operations  in  [\'ao  78].  A  paper  that  in- 
dependently derived  the  same  result  by  nearly  the  same  methods  is  [Nakamura  78].  Yao's  method  of 
analysis  is  generahzed  to  fnnge  analysis  in  [Eisenbarth  82].  Fringe  analysis  is  the  analysis  of  the  leaves 
of  a  tree  data  structure.  Eisenbarth  et.al.  showed  how  to  solve  the  matrix  recurrence  equations  that 
result  from  a  pure-insert  model.  B-tree  variants  were  analyzed  using  fringe  analysis  in  [Baeza- Yates  89a] 
and  [Baeza- Yates  89b].  Another  B-tree  variant  that  has  high  space  utilization  was  analyzed  in  [Kuspert 
83].  An  alternative  approju:h  to  estimating  the  utilization  of  a  B-tree  appeared  in  [Leung  84]  and  was 
improved  on  in  [Gupta  86].  Yao's  method  was  elaborated  upon  to  obtain  the  probability  of  splitting  in 
[Wright  85].  The  number  of  children  of  the  root  of  a  B-tree  was  approximated  in  [DriscoU  87]. 

None  of  the  above  papers  addressed  the  problem  of  deletes  in  the  instruction  mix.  Mizoguchi  [Mi- 
zoguchi  79]  proposed  an  approximate  model  for  merge-at-empty  B-trees  in  order  to  analyze  utilization. 
The  range  of  his  analysis  was  from  pure  inserts  to  33%  deletes/77%  inserts.  He  also  predicted  the  uti- 
lization at  50%  deletes/50%  inserts  (pure  modifies),  but  his  solution  was  pessimistic.  Our  Unear  model 
is  similar  to  Mizoguchi's.  (We  show  that  the  linear  model  is  not  a  good  approximation  as  the  percentage 
of  deletes  approaches  50%.) 

Quitzow  and  Klopprogge  [Quitzow  80]  proposed  a  differential  equation  model  to  predict  the  utilization 
for  both  merge-at-empty  and  merge-at-half  B  trees.  They  analyzed  the  case  of  pure  inserts  (where  they 
are  consistent  with  Yao)  and  the  case  of  pure  modifies.  Our  linear  model  is  similar  to  theirs. 

Wright  and  Langenhop  [Langenhop  85]  also  have  a  model  for  merge-at-half  B  trees  with  pure  modify 
operations.  Their  predictions  are  different  from  those  in  [Quitzow  80]  (and  ours).  It  turns  out  that 
simulation  results  fall  roughly  in  between. 

Hsu  and  Zhang  [Zhang  87]  analyzed  restructuring  rates  (i.e.  rates  of  splits  and  merges)  as  a  function 
of  the  merging  point  using  a  simple  mathematical  model.  Our  qualitative  conclusions  are  similar  to 
theirs,  but  our  quantitative  predictions  are  significantly  different. 

So,  the  main  contributions  of  the  present  work  are  to  show  the  limits  of  the  linear  model  (as  compared 
with  the  ghost  model),  an  analysis  of  the  utilization  as  the  number  of  deletes  approaches  the  number  of 
inserts  (indicating  the  sharp  knee  in  the  utilization  curve),  a  set  of  predictions  of  restructuring  rates  as 
characterized  by  a  simple  rule  of  thumb,  and  a  comparison  of  the  analytical  models  with  a  simulation  of 
an  actual  B  tree. 

2     Merge-at-Empty  Btrees 

2.1      Pure  Inserts 

In  this  section,  we  review  Yao's  analysis  for  pure  inserts,  because  our  analysis  is  a  generalization  of  Yao's. 
We  begin  with  the  terminology  and  methodology. 


A  B-irte  is  a  balanced  tree  in  which  the  distance  between  the  root  and  any  leaf  is  the  same  (Knuth 
73]).  A  B-tree  is  of  parameter  p  if  each  non-leaf  node  has  between  1  and  2p-  1  children.  The  children  of 
an  interior  node  are  accounted  for  by  entries  in  a  node;  there  is  one  entry  per  child.  The  entries  contain 
key  and  pointer  information.  The  leaf  nodes  contain  the  items  in  the  B-tree.  A  B-tree  node  is  of  order 
k  if  it  has  k  entries. 

Define; 

A'':   number  of  inserts  (assumed  successful). 

X,{N):  number  of  order  ;  nodes  in  a  particular  B-tree  of  size  A''  (i.e.  resulting  from  TV  inserts). 

Ai{N):  expected  number  of  order  i  nodes  in  a  random  B-tree  of  size  N . 

ai(N):  =  A,(N)/N. 

U:   expected  space  utilization  of  a  random  B-tree 

P,:   probability  of  splitting  on  an  insert 

Let  t[N]  be  a  particular  random  B-tree  of  parameter  p  with  N  items.  Then  A'/'  '  is  the  number  of 
leaf-level  nodes  in  t[N]  of  order  i.  Let  Pr[<[yV]]  be  the  probability  that  a  random  B-tree  of  parameter  p 
with  A'^  items  is  t[N].  Define  -4,(iV)  to  be  the  expected  number  of  leaf-level  nodes  of  order  i  in  a  tree 
with  TV  items.  Then 

A.(yV)  =  ^Af^lpr[<[TV]] 

t[N] 

A  B-tree  of  parameter  p  with  .V  items  is  oi  class  (Ai,  A2, . . .,  A'2p|TV)  if  it  has  A'l  nodes  of  order  1  on  the 
leaf  level,  X2  nodes  of  order  2  on  the  leaf  level,  etc.  Let  Pr[(A'i, . . . ,  A'2p_i|TV)]  be  the  probability  that 
a  random  B-tree  of  parameter  p  with  TV  items  is  of  class  (Ai, . .  . ,  A2p_i|A'^).  Another  way  to  calculate 
A,  is; 

Ai{N)=  Y.  A:iPr[(Xi,...,Ai,...,X2p-i|TV)] 

(x,,.,.,x,,  ..x,,_,|;v) 

Suppose  that  we  have  a  B-tree  of  class  (A'p, . . . ,  A2p-.i|TV)^  and  an  insertion  occurs.  If  the  insertion 
goes  to  an  order  j  node,  p  <  i  <  2p  —  1,  then  the  result  is  a  B-tree  of  class  ( A'p, . . . ,  Xj  —  1,  A',+i  -|- 
1, . . . ,  A2p_i|TV  -I-  1).  If  the  insertion  goes  to  an  order  2p  -  1  leaf  node,  then  the  result  is  a  B-tree  of 
class  ( Ap  -I-  2, . .  . ,  A2p_i  -  1|TV  -f  1),  because  splitting  an  order  2p-  1  node  results  in  two  order  p  nodes. 
Next,  in  order  to  specify  completely  the  way  that  a  B-tree  evolves  through  inserts,  we  must  specify  the 
probability  that  an  insert  is  directed  to  an  order  «  node. 

To  do  that,  specify  an  input  probabiUty  distribution.  We  will  use  a  uniform  distribution  in  the 
following  sense;  Any  of  the  TV!  permutations  of  an  TV  item  insert  sequence  are  equally  likely.  If  TV  items 
have  already  been  inserted,  then  there  are  TV  +  1  positions  to  which  the  next  insert  can  be  directed  to 
(TV  —  1  intervals  and  2  end  positions),  each  of  which  is  equally  likely.  The  probability  that  a  given  node 
of  order  i  receives  an  insert  is  thus  i/(TV  -|-  1).  In  addition  if  the  node  is,  say,  the  rightmost  one,  then  the 
node  hais  an  additional  probability  of  1/(TV  -|-  1)  of  receiving  the  insert.  Therefore,  the  probability  that 
an  insert  is  directed  to  an  order  «  node  in  a  class  (A'p, . . . ,  A,, . . . ,  A'2p_i  |TV)  B-tree  is 

-f-  — Pr[the  rightmost  item  is  in  an  order  i  node] 


AT-l-  1       TV-l-  1 


iXi  1      jA, 

+ 


A^-l-l      N+l  N 


iXi  =  A^2=  ■•■  =  Xp_i=0 


=  i^ 

N 
Using  the  above  probabilities,  we  can  develop  a  system  of  recurrence  equations  for  the  A,{X) 

MN+l)=     E.AV    ,x.->|.v,P^[('^'p AVim]{^(AV-l)+ 

'-^-y-p-(AV  +  2)  +  -v-pAV-'^P-nA-.p-'A-,} 
A(.V+1)=      E,A-„    ,x.,..lN)N(A'p.--.A'2p-il^V)]{i^(X.-l)+ 

L:zi^^(  A-.  +  1)  +  ^-^.-^^-i)^--  A-.}  i^p 

These  reduce  to 

Ap(N  +  l)  =  A,{N)(1  -  p/N)  +  A2p.dM       ^~ 

A,(N  +  1)  =  A,{N){1  -  i/N)  +  A,.i{N)'-^ 
Making  the  substitution  a,{N)  =  A,{N)/N 

(N  +  l)ap(N  +  1)  =  ap(N){N  -  p)  +  2(2p  -  l)a^p.,{N) 

(N  +  l)a,{N  +  1)  =  a.(A')(^  -  0  +  (i  -  l)a.-i(iV) 

The  ai(N)  describe  the  proportion  of  nodes  on  the  leaf  level  that  are  of  order  i.  Up  to  this  point, 
we  have  been  following  Yao's  treatment  of  the  pure-insert  problem.  Next,  however,  we  will  deviate 
from  Yao's  methods  and  transform  the  equations  into  a  set  involving  terms  of  the  form  Aa,{N  +  1)  = 
a,{N  +  I)  -  a,{N): 

{N+  \)^ap(N+\)=     -ap(N)(p+l)  +  2(2p-l)a2p-i(N) 
(7V+l)Aa.(7V  +  l)=         -a,{N)(i+l)  +  {i-l)a,.i{N)         i  jt  p 

We  want  to  solve  for  the  equilibrium  point  of  this  set  of  simultaneous  difference  equations  -  that  is, 
the  point  where  Aai{N  +  1)  =  0.  At  the  equilibrium  point,  the  ai{N)  will  not  change  with  A^ ,  so  we  can 
remove  the  dependence  on  N  to  obtain  the  set  of  equations; 

0=     -(p+\)ap  +  2(2p-\)a2p-i 

0=         -(,+  i)ai  +  (,-l)ai_i         ijtp 

In  order  to  prove  the  existence  and  uniqueness  of  the  equilibrium  point,  make  the  substitution  Pi  = 
iaj.  Then  we  have  the  equations 

0=     -(p+i)Pp  +  2pP2p-i 

0=       _(,+  i)p, +  ,p,_,        ,-^p 

The  above  equations  sum  to  0,  so  it  is  eaisy  to  see  that  the  above  p  equations  are  of  rank  p  —  1. 
Therefore,  the  matrix  of  coefficients  is  singular,  so  that  the  set  of  equations  has  a  non-trivial  solution. 
That  the  matrix  is  of  rank  p—  1  abo  means  that  it  hais  only  one  eigenvalue  of  0,  so  the  equiUbrium  point 
is  unique.  Since  the  equilibrium  point  of  the  Pi  exists  and  is  unique,  the  equihbrium  point  of  the  a^ 
exists  and  is  unique  also.  The  solution  that  we  are  looking  for  is  subject  to 

2p-l 

E  '■«■  =  1- 

•=p 


The  stability  of  the  solution  follows  from  the  fact  that  non-equilibrium  states  will  tend  towards 
equilibrium  states,  sis  can  be  seen  by  considering  the  system  to  be  a  system  of  flows.  Think  of  each 
a,  as  being  a  cell.  The  flow  out  of  a  cell  is  directly  proportional  to  the  value  of  the  cell,  the  flow  in 
is  directly  proportional  to  the  value  of  the  neighbors.  Suppose  that  a  cell  has  less  than  its  equiUbrium 
value.  Because  X^^f"'  ia,  -  1,  some  other  cells  must  have  more  than  their  equilibrium  value.  Therefore 
the  flow  out  of  the  cell  will  be  less  than  at  equilibrium,  and  the  flow  in  will  be  larger  than  at  equilibrium. 
The  result  is  a  net  flow  in,  and  the  value  of  the  cell  will  grow.  The  converse  holds  if  a  cell  has  greater 
than  its  equilibrium  value.  This  argument  generaUzes  to  a  sequence  of  cells  with  less  or  more  than  their 
equiUbrium  values. 

We  have  a  system  of  simultaneous  linear  equations  that  we  can  solve  numerically.  However,  this 
system  is  simple  enough  that  we  can  give  an  algebraic  solution. 

Theorem  1  ([Yao  78]) 

«i=    7nTr)(^(2p)-//(p))-'    P<'<2p-i 

Where  //(p)  is  the  harmonic  function  /f(p)  =  Y2'i=\  V'  *  li^P- 
The  following  lemma  has  appeared  in  [Wright  85]: 

Lemma  1    The  probability  of  inserting  at  an  order  i  node  on  the  leaf  level  when  all  of  the  operations  on 
the  B'iree  are  inserts  is  ia,. 

Corollary  1    The  probability  of  inserting  at  a  full  node  on  the  leaf  level  when  all  of  the  operations  on 
the  B-tree  are  inserts  is 

P,  =(2p-l)a2p-i 

The  value  (2p  -  l)a2p-i  is  approximately  l/(2pln2)  when  p  is  large  (where  the  maximum  node  size 
is  2p-  1).  This  contradicts  the  false  intuition  that  the  probabihty  of  spUtting  at  the  leaf  level  would  be 
1/p,  as  p  insertions  into  an  half-empty  node  causes  a  split.  The  reason  is  that  the  B-tree  is  growing;  the 
factor  of  2  In  2  in  the  denominator  is  the  result. 

The  space  utilization,  U ,  of  a  B-tree  is  the  portion  of  spaice  taken  up  by  the  B-tree  that  store  in- 
formation. The  following  lemma,  which  has  appeared  in  [Mizoguchi  79]  and  [Quitzow  80],  will  also  be 
useful: 

Lemma  2   IfUis  the  utilization  of  the  B-tree,  then 

1 


U  = 


(2p-l)E«; 
Corollary  2    The  utilization  of  a  B-tree  with  pure-insert  operations  ia 

t;  =  HEi^M^^M  «  69% 

2p-  1 
Where  H{p)  is  the  harmonic  function  defined  previously. 
Proof:  Sum  the  at  and  apply  the  lemma.  Jit 

2.2      Pure-Modify  Operations 

The  next  instruction  mix  that  we  will  examine  is  pure-modtfy.  Every  operation  is  assumed  to  be  a  modify 
operation,  which  is  modeled  as  a  delete  followed  by  an  insert.  That  is,  a  modify  deletes  one  key  then 
inserts  one  key.  We  cissume  that  the  B-tree  is  initially  built  from  some  sequence  of  operations  (which 
may  contain  some  deletes),  then  heis  a  long  sequence  of  modify  operations  applied  to  it.  Both  the  insert 
and  delete  of  the  modify  operation  are  assumed  to  be  successful. 


2.2.1      Ghost  Model 

Both  [Mizoguchi  79]  and  [Quitzow  80]  have  analyzed  the  case  of  pure-modify  operations  on  merge-at- 
empty  B-trees,  but  they  assume  that  the  probability  that  a  node  receives  an  insert  is  proportional  to 
its  size.  The  problem  with  this  model  is  that  after  a  large  number  of  modify  operations,  the  number  of 
items  in  a  node  bears  little  relation  to  the  number  of  insert  operations  that  were  performed  on  the  node, 
yet  the  probabihty  that  a  node  will  receive  an  insert  is  related  to  the  number  of  inserts  performed  on  the 
node.  We  will  define  a  ghosi  to  be  a  data  item  that  was  inserted,  then  later  deleted.  Though  the  ghost 
no  longer  exists  in  the  B-tree,  it  still  affects  the  distribution  of  insert  operations.  We  must  therefore  keep 
track  of  the  number  of  ghosts  at  each  node. 
Define: 

N:   the  number  of  items  in  the  B-tree. 

M:   number  of  modify  operations. 

Ai(M):  expected  number  of  order  i  nodes  after  modify  operation  M. 

Bi{M):  expected  number  of  order  i  nodes  after  the  delete  portion  of  modify  operation  M. 

Ci(M):  the  average  number  of  ghosts  contained  in  all  of  the  order  j  nodes  after  M  modify  operations. 

C[(M):  The  average  number  of  ghosts  contained  in  an  order  :'  node  after  M  modify  operations.  Ci{M)  = 
C,{M)/Ai. 

Ci{M)   :  The  proportion  of  ghosts  that  are  in  order  i  nodes  after  M  modifies.  Ci{M)  =  Ci{M)/M . 

Di{M):  the  average  number  of  ghosts  contained  in  all  of  the  order  i  nodes  after  the  delete  portion  of 
modify  operation  M . 

D[(M):  The  average  number  of  ghosts  contained  in  an  order  i  node  after  the  delete  portion  of  modify 
operation  M.  C',{M)  =  C,{M)/Ai. 

di{M)   :  The  proportion  of  ghosts  that  are  in  order  i  nodes  after  the  delete  portion  of  modify  operation 
M.  c^iM)  =C,{M)/M. 

v.    =  ^Oi. 

Pm-  probability  of  removing  (merging)  a  node  because  of  a  delete. 

Let  us  first  develop  the  equations  that  describe  the  a^.  After  the  delete  portion  of  the  modify: 

Bi(M  +  1)  =  (1  -  i/N)Ai{M)  +  i±-LA,+i(A/)     i^2p-l 

52p-i(M+  1)  =  (1  -  ^P^)A2p.i(M) 

Next  we  turn  to  the  insert.   The  number  of  nodes  of  order  i  after  an  insert  is  the  number  of  nodes 
before  the  insert,  minus  the  expected  loss,  plus  the  expected  gain: 

Ai{M  +  1)  =  (previous)  -  (Pr[order  i  node  hit])(/oss)  -  (Pr[order  i  -  1  node  hit])(yain) 

Here,  the  previous  value  is  B,(A/  +  1)  and  the  gain  or  loss  is  1.  After  a  large  number  of  modifies,  there 
will  be  many  more  ghosts  than  data  items  in  the  tree.  Therefore,  the  probability  that  a  node  of  order  i 


receives  the  insert  approaches  c,(.V/). 

A.{M  +  1)  =  B,{M  +  1)  -  c,(M)  •  1  +  c,_i(A/)  ■  1 

substitute  in  the  value  of  5,(M)  : 

(1  -  i/N)A,{M)  +  i^A.+i(M)  -  c,(M)  -  c,-i(A/) 
a.(A/  +  1)  =     (1  -  i/N)a,(M)  +  l^a.+i(A/)  -  c,(M)/N  +  c._i(M)/.V 

Solving  for  Aa,(A/+  1), 

NAa,(M  +  1)  =     -m,(M)  +  (i  +  l)ai+i{M)  -  Ci{M)  +  c,_i(M) 

The  equations  for  the  a,(A/)  depend  on  values  for  the  c,{M).  We  next  develop  equations  that  describe 
the  Ci(M).  Again  model  a  modify  as  a  delete  followed  by  an  insert,  and  calculate  the  new  number  of 
ghosts  as  the  previous  number  minus  the  expected  loss  plus  the  expected  gain.  Start  with  the  delete: 

Di{M  +  1)  =  {previous)  —  (Pr[order  i  node  hit])(/oss)  +  (Pr[other  node  hit])(gafn) 

The  previous  value  is  C,(Af).  The  probability  that  a  delete  is  directed  to  an  order  i  node  is  iAi(M)/N , 
the  proportion  of  data  items  contained  in  order  t  nodes.  The  loss  is  C,'(A/),  the  expected  number  of 
ghosts  in  an  order  i  node.  Di  will  gain  if  a  delete  was  directed  to  an  order  i  +  1  node  or  to  an  order  1 
node.  The  gain  from  an  order  i  +  1  node  is  C,'^i(Af )  +  1,  because  a  new  ghost  is  created.  If  a  delete  is 
directed  to  an  order  1  node,  then  the  node  is  merged.  The  node  immediately  to  the  right  will  receive  the 
merged  node's  key  range  -  thus  will  receive  the  merged  node's  ghosts.  The  probability  that  an  order  i 
node  receives  the  deleted  node's  ghosts  is  ai(A/)/H,=7    ^ji^H-  Therefore; 

Di{M  +  1)  =  Ci{M)  -  CJi^)iC',{M))  +  (Li±i>4^±iiMi)(c.'^^(A/)  +  i) 

.     C.(Af)  +  i%^  +  C^i)C.f'A^)  +  }^f^,^l'^^l^  +  (.-  +  l)a..i(A/)  +  gg^ 

D,(M  +  1)    _  .  ^.<-\    ,     ic.(M)     ,    (■  +  l)e..n(M)     ,         a.(M)c,(M)         ,    {i  +  l)a,  +  i{M)  a.(M)ai(M) 

The  terms  ((t  +  l)a,+i(M))/A/  and  (ai{M)ai(M))/{MYi%V  ^ii^))  become  insignificant  as  M 
becomes  large,  so  we  may  drop  them.  Since  Di(M+l)/M  =  Di(M+l)/{M +  l)  +  Di{M+l)/[M(M +  1)], 
we  may  assume  that  Di{M  +  1)/A/  ss  di{M  +  1),  so  that: 


Ml 
M) 


d,{M+\)=     c,(A/)  +  i^  +  l-^')-^f^^)  +  ^°^^1V^^^^     i=l,...,2p-2 
<i,(A/+l)=  c,(M)+i^+     °^^1V^^^  ;  =  2p-l 

The  Cj(A/  +  1)  evolve  from  the  Z)i(A/  +  1)  in  the  following  way: 

C,(A/  +  1)  =  (previous)  -  (Pr[order  i  node  hit])(/oss)  +  (Pr[order  i  -  1  node  hit])(yam) 


The  previous  is  D,{M  +  1).  The  probability  that  an  insert  is  directed  to  an  order  i  node  is  d,{M  +  1), 
and  the  gain  or  loss  is  D[{M  +  1).  Therefore; 

CdM  +  1)     =  D,{M  +  1)  -  {d,{M  +  imD,(M  +  1)M,(A/))  +  (ci,-i(A/  +  l))(A-i(M  +  1)/A,.:{M)) 

c,(M+l)  =c,(M)-7/-+ ^^7^ ;V^''-'a,(A/)  ~     -V.iA/)     •*"     A.."(M) 

At  the  equilibrium  point,  the  a,{M),  c,(M)  and  the  (i,(M)  are  constant,  so  remove  their  dependence 
on  M  Including  the  exceptional  cases  at  i  =  l,p,  2p—  1,  the  equations  that  determine  the  equilibrium 
point  of  the  a,  are: 

0=  -Qi +  (i  + l)a,+i  -  c-  t=l  (1.1) 

0=  -m, +(i  + l)a,+i -c, +c,_i +2c2p-i  i  =  p  (1.2) 

0=  -ia, -c, +  c,_i  i=2p-l  (1.3) 

0=  — ta,  +  (i  +  l)a,+i  -  c,  +  c,_i  otherwise  (1.4) 

In  addition,  ^'£7'  "*<  =  1-  If  ^^  solve  for  the  point  at  which  NAc,(A/+  1)  =  0,  the  following  equations 
describe  the  equilibrium  point  of  the  Ci'. 

0=  _,c, +  (.  +  l)c.+,  +  _^-l^ f  i=l 


0=     -,c. +  (.-+l)c,+i+       ■'•^'  ''•    •    '' 


0=  -'^.  +  y^#^  -  J  +  fc  «  =  2p-l 

0=  _ic, +  (i  + l)ci^i  +  — ^j^J, ll+li^  otherwise 

Z-,=i    "'         '         '    ' 

Additionally,  ^^^7^  ci  =  1. 

We  can  simplify  the  equations  beyond  this:  Consider  equations  (1.1  -  1.4).  Each  is  parameterized  for 
a  certain  set  of  values  of  i.  Let  the  equation  for  a  particular  j  be  denoted  by  e,.  Add  ei  and  ej  to  get: 


Adding  63  to  this,  we  get 
This  form  continues  up  to  ep_i 
Adding  Cp,  we  get 
This  form  continues  until  e2p_i 

Adding  e2p_i,  we  get: 


0  =  3a3  -  C2  -  ai 

0  =  4a4  -  C2  —  oi 

0  =  pOp  -  Cp_i  -  oi 

(P+  l)ap+i  -Cp  +  2c2p_i  -ai 

(2p  -  l)a2p_i  -  C2p_2  +  2c2p_i  -  Oi 
^1   =  C2p_i 


Substituting  C2p-i  for  ai  in  the  previous  equations,  we  get; 

fll   =  C2p_l 

ia,  =  c,_i +  C2p_i  2<i<p 

iai  =  c,_  1  -  c-2p- 1      P  +  1  <  2  <  2p  -  1 

Examining  the  equations  for  the  d,,  we  see  that  di  — >  c,  as  iV  — »  oo.  Let; 

2p-l  2p-l 

^  =  E  «■■  =  E  ^'/(''  +  ^)  +  [2^(p)  -  '^(^P  -  i)]c2p-i 

1=1  1=1 

Using  this  and  the  equations  for  the  a,,  we  get: 
Theorem  2    The  values  ofci,  1  <  J  <  2p  —  1  satisfy  the  following  system  of  non-linear  equations: 

0=  _ei+2c2+^^-t^-^  i=l 

0=  _.e.  +  (.  +  l)c.,,  +  !i-^^^-ii^-,:3,^+iiI^  2<.<p 

0=     -pCp^(p^l)Cp.i^'-^---;r-^--,;,,^^ai^  .  =  P 

0=  -»c.  +  (,  +  DC,  +  (--;jr- )c,  _  _^ ^  ilzilil^  ,^p+l 

0=  ^ic,  +  i^+l)c,^,  +  (c-.-c.p-.^c.  _  ^^^^^?^^^^  ^  ^(^-nc?,.^  p+l<i<2p-l 

0=  -(2p-  i)c2p-i  +  '-u--.^:--'-  -  sab  +  Sfe  '  =  2P-1 

The  equations  for  the  Ac,  are  linearly  dependent;  they  sum  to  0.  Remove  one  of  the  equations  to 
get  2p  —  1  equations  in  2p  —  1  variables,  and  the  system  is  ready  to  be  solved  by  a  non-linear  equation 
solving  package.  The  package  we  used  is  NAG  [NAG]. 

Three  simulation  experiments  were  run.  A  B-tree  was  built  using  insert  operations,  and  then  a  long 
sequence  of  modify  operations  was  performed.  The  experiments  were  run  for  p  =  5,  10,  15,20.  Table  2 
compares  the  utilization,  U,  and  probabihty  of  sphtting,  P,,  from  the  simulation  and  the  ghost  model. 
The  calculated  values  for  the  utilization  are  within  10%  of  the  utilization  observed  in  the  simulation. 
The  calculated  and  observed  probabilities  of  sphtting  diverge,  but  they  both  decreaise  exponentially  with 
p.  Figure  1  compares  the  distribution  of  the  a;.  Figure  2  shows  the  calculated  utilization  for  p  between 
5  and  25.  In  the  model  as  well  as  the  simulation,  the  utilization  approaches  39%,  and  the  probabihty  of 
splitting  or  deleting  a  node  decreases  exponentially  with  p. 

2.3     Parameterized  Inserts  and  Deletes 

An  intermediate  operation  mix  between  pure-insert  and  pure-modify  is  the  case  where  there  are  deletes  in 
the  operation  mix,  but  there  are  more  inserts  than  deletes.  We  will  call  this  operation  mix  parameterized 
by  q.  Let  us  define; 

L:  number  of  operations  performed. 

q;  probability  that  a  given  operation  is  a  delete. 

Therefore,  an  insert  will  occur  with  probability  1  ~  q  and  the  expected  number  of  items  in  the  tree 
after  L  operations  will  be  L(l  —  2^). 


2.3.1      Ghost  Model 

Because  there  are  deletes  in  the  instruction  mix,  but  there  are  more  inserts  than  deletes,  both  the  number 
of  items  and  the  number  of  ghosts  in  the  leaves  will  grow  with  L.  The  expected  number  of  items  in  the 
tree  will  be  L(l  —  2q)  after  L  operations,  and  the  expected  number  of  ghosts  will  be  Lq,  so  that  the  total 
expected  number  of  ghosts  and  items  will  be  L(l  —  q).  Therefore  C,  =  Lqc,  and  A,  =  L(l  —  2g)a,.  The 
actual  number  of  items  and  ghosts  in  the  B-tree  will  have  a  binomial  distribution,  but  by  the  law  of  large 
numbers,  the  deviation  will  become  negligible  compared  to  the  mean. 
The  equations  that  describe  the  A,,  1  <  i  <  2p  —  1  and  i  :^  1  are: 


A(i+1)  = 
(1+  1){1  -  2q)a,iL  +  I)  = 


-iqa,{L)  +  (i  +  l)qa,  +  i(L)  -  qc,{L)  -  i(l  -  2q)a,(L) 
+qc,.i{L)  +  (z  -  1)(1  -  2q)a,.i(L)  +  L(l  -  2q)ai{L) 
(L  +  1)(1  -  29)Aa.(I  +  1)  =      -(j(l  -  9)  +  (1  -  2q))a,  -  qc,  +  {i  +  l)qa,+^  +  gc,_i  +  (i  -  1)(1  -  2q)a,_i 

The  equations  that  describe  the  Ci,  1  <  i  <  2p  —  I  and  i  ^^  1  are: 

(L  +  l)qc,{L  +  1)=  i<l^iiL)i-^,-%'!(L))  +  ('•+l)<?a.+  i(L)(;/i-t-lfj,  +  1)  + 

((,c._,(L)  +  (.  -  1)(1  -  2<?)a,_,(i))(^3^^jyiii^  +  L9C.(L) 

-'^■(t^  +  rr^  +  '■?  +  g)  +  "^l-2>'  +  ^'-i((i^2;');;-.  +  ('  -  i)g)  + 


(L+l)(7Ac.  = 


Let  t;  =  XI a,.    After  taking  into  account  the  special  cases  at  j  =   1,  j  =  p  and  i  =  2p-  1,  and 
performing  algebraic  manipulation,  we  get: 

Theorem  3   For  a  merge-ai-half  B-tree  that  ts  parameterized  by  q,  the  steady  state  values  of  Oi  and  Ci, 
1  <  j  <  2p  —  1  IS  the  solution  to  the  following  set  of  non-linear  equations: 


0  = 
0  = 

0  = 
0  = 

0  = 

0  = 


-(j(l  -  9)  +  (1  -  2q))ai  -  qa  +  (i  +  1)90,+!  i  =  1 
-(i(l  -  g)  +  (1  -  2g))a,  -  qc,  +  («  +  \)qa,^x  +  qc.-i  +  (i  -  1)(1  -  2q)a,.^ 

+2?C2p_i+2(2p-l)(l-2(?)a2p_i  i  =  p 

-(i(l  -?)  +  (!  -29))a,  -9c,  +  gc,_i  +  (.  -  1)(1  -29)0^-1  i  =  2p  -  1 

-(»(1  -  9)  +  (1  -  2q))ai  -  qc,  +  (i  +  l)ga,+  i  +  qa^^  +  (j  -  1)(1  -  2q)ai.i  otherwise 

-Ci(iq  +  gci/oi  +  (1  -  2q){i  +  1))  +  (i  +  1)90^+1 
+  7^(9^1  +  (1  -  29)ai)  +  (1  -  29)(i  +  l)a.+  ,  i  =  1 

-Ci(i<z  +  qc,/a,  +  (1  -  29)(i  +  1))  +  (i  +  l)<,c,+i  +  c._i(9c._i/ai_i  +  (1  -  2<7)(t-  -  1))+ 


10 


^{qc,  +  (1  -  2q)a,)  +  (1  -  2q){,  +  l)a,+,  +  C2p-i(^^  +  (1  -  2q){2p  -  1))  i^p 

0  =  -c,iiq  +  qcja,  +  (1  -  2q)(i  +  1))  +  c,_ i((?c._i/a,_i  +  (1  -  2g)(2p  -  2)) 

+  ^iqci+{l-2q)a,)  ,  =  2p  -  I 

0  =         -c.(i9  +  qc,/a,  +  (1  -  29)(i  +  1))  +  (i  +  l)<?c,+i  +  c._i(^  +  (1  -  29)(i  -  1))  + 

^(qci  +(1  -29)ai)  +  (l  -  2q){i  +  l)a,+i  otherwise 

0=  i-e;^;''^. 

The  equations  for  the  equiUbrium  point  of  the  Ac^  are  linearly  dependent,  as  again  they  sum  to  zero. 
In  order  to  solve  the  system,  remove  one  of  the  Ac,  equations,  which  gives  4p  —  2  equations  in  4p  —  2 
variables. 

Note  that  if  ^  =  .5,  then  we  have  the  pure  modify  model,  and  if  9  =  0,  then  we  have  the  pure  insert 
model.  Thus  this  model  generaUzes  both  the  pure-modify  ghost  model  and  Yao's  pure-insert  model. 

If  we  add  together  the  equations  for  the  equilibrium  points  of  the  Aa,,  we  get  the  relation: 

2p-l 

0  =  -(l-2g)  Y^  a,  -  a^q  +  {\  -  2q)(2p  -  l)a2p-i  +  qc2p-i 
1=1 

The  formula  can  be  interpreted  as  follows:  On  a  delete  operation,  the  probability  of  merging  a  node  is 
oi-  Since  q  is  the  probabihty  of  an  operation  being  a  delete,  qoi  =  Rj  =  the  rate  at  which  nodes  are 
merged,  in  nodes  per  operation.  Similarly,  (1  —  2q){2p  —  l)a2p-i  +  qc2p-i  =  R,  =  the  rate  at  which 
nodes  split.  If  we  let  R,  —  Rj  =  Rg  =  the  rate  at  which  nodes  are  added  to  the  B-tree,  and  we  recall  the 
relationship  between  U  and  ^a^,  we  have  the  relationship: 

u=     '-'' 


{2p-\)Rg 

As  we  shall  see  in  the  next  section,  U  remains  stable  over  a  wide  range  of  g  if  p  is  large.  Since  Rj  is 
negligible  compared  to  R,  if  9  <  .5,  the  relationship  can  be  used  as  a  rule  of  thumb  to  calculate  the 
probability  of  splitting. 

The  equations  were  solved  for  q  =  .45  and  q  =  .47,  and  p  varying  between  5  and  20.  Table  2  shows  a 
comparison  between  analytical  results  and  simulation  results. 

Unfortunately,  these  equations  are  difficult  to  solve,  as  non-linear  equation  solvers  require  a  good 
estimate  of  the  starting  point.  We  calculated  the  solutions  for  only  a  few  cases.  This  points  out  the  need 
for  an  approximation  that  can  be  eaisily  solved. 

2.3.2      Linear  Model 

The  difficulty  in  solving  the  ghoet  mcxlel  of  a  B-tree  is  that  the  ghosts  satisfy  a  non-linear  recurrence. 
Ghosts  are  necessary  when  the  number  of  ghosts  in  a  node  outnumber  the  number  of  items.  If  q  is 
small,  however,  then  the  number  of  items  in  a  node  is  greater  than  the  number  of  ghosts  in  a  node.  In 
addition,  the  distribution  of  the  ghosts  among  the  nodes  of  different  order  has  a  similar  distribution  to 
the  number  of  items.  Therefore,  we  can  try  making  the  approximation  that  the  probabihty  that  a  node 
of  order  »  receives  an  insert  is  ioi.  Using  this  approximation  gives  us  linear  equations,  so  we  wUl  call 
this  model  the  linear  model.  Mizoguchi  described  a  similar  model  for  B-trees  with  even  maximum  node 
size  [Mizoguchi  79].  In  this  section,  we  will  describe  the  linear  model  using  the  methods  that  we  have 
developed,  examine  its  range  of  accuracy,  then  examine  the  results  on  B-tree  utiUzation. 
The  linear  model  is  described  by  the  following  recurrence: 


ML  +  1)  =  q[(l-  ^^^MD  +  ^^i±i^A.+i(L) 
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If  we  then  solve  for  AP,,  where  P,  =  la,,  we  get: 

Theorem  4    Under  the  approitmaiwn  that  a  node  of  order  i   receives  an  insert  is  ia,,   the  Pi   are  the 
solution  to  the  following  set  of  linear  equations: 

0=  -(I  +  {\  -  2q))Pi  +  qP2  j=l 

0=  -(p+(l-2q))P^  +  (l-q)pPp-i+qpP,+i+2p{l-q)P2p-i  i^p 

0=  -(2p-l  +  (l-29))P2p-i  +  (l-9)(2p-l)P2p-2  i  =  2p-l 

0=  -{i  +  {l-2q))P,  +  {\-q)iP,-i+qiP,+i  otherwise 

0=  i-E'^T'^. 

The  existence,  uniqueness  and  stability  of  the  equilibrium  point  holds  by  the  previous  argument. 

Tables  4,  5  and  6  show  a  comparison  between  the  simulation  results  and  the  analytic  results.  Ex- 
amining the  tables,  we  see  that  the  linear  approximation  is  very  accurate  for  q  <  A.  At  q  =  .45,  the 
ghost  model  gives  results  that  more  closely  match  those  of  the  simulation.  The  case  of  ij  =  .5  gives  the 
merge-at-empty  pure-modify  models  of  [Mizoguchi  79]  and  [Quitzow  80].  The  a^  calculated  from  the 
model  for  q  —  .5  is  plotted  along  with  the  simulation  and  ghost  model  results  in  figure  1.  As  can  be  seen, 
the  hnear  model  does  not  calculate  the  distribution  of  nodes  for  q  =  .5. 

If  9  <  .5  however,  the  linear  model  becomes  more  accurate  as  p  increases.  In  other  words,  when 
inserts  significantly  outnumber  deletes  and  the  maximum  node  size  increzises,  the  linear  model  becomes 
more  accurate.  This  happens  for  two  reasons;  as  p  increases  the  distribution  of  the  Ci  becomes  more 
similar  to  the  distribution  of  the  za^;  also  there  are  fewer  very  small  nodes,  the  nodes  for  which  za,  is  the 
poorest  approximation  to  c,. 

Figure  3  shows  a  comparison  between  the  ghost  model,  the  Unear  model  and  the  simulation.  Notice 
that  the  linear  model  always  gives  low  estimates  of  the  utiUzation,  even  for  the  pure-modify  case.  The 
low  estimates  of  the  utilization  occur  because  the  linear  model  underestimates  the  probability  that  a 
small  node  receives  an  insert  and  thus  becomes  larger  (a  small  node  must  have  received  many  deletes). 
This  means  that  we  can  use  the  hnear  model  to  provide  a  lower  bound  on  the  utilization  of  a  random 
B-tree. 

Figure  4  shows  how  the  utiUzation  varies  with  p  for  different  q.  Notice  that  as  p  becomes  large,  the 
utilization  curve  becomes  flatter  with  a  sharper  knee  near  q  =  .5.  If  we  examine  the  entries  in  Table  5 
for  p  =  100,  we  see  that  the  linear  model  predicts  a  utilization  of  65.48%  even  when  q  =  .47,  which  is  a 
lower  bound  by  the  previous  paragraph. 

While  the  hnear  model  gives  reeisonable  estimates  of  the  spaice  utilization,  it  gives  very  pcxjr  estimates 
of  the  probabihty  of  spUttingor  merging  when  q  becomes  close  to  .5.  The  linear  model  predicts  a  quadratic 
decrease  in  P,  and  Pm  as  p  becomes  larger  for  pure-modify  operations,  but  the  ghost  model  and  the 
simulation  show  an  exponential  decrease.  Since  the  linear  model  underestimates  the  probability  that 
a  small  node  receives  an  insert,  the  hnear  model  overestimates  the  probability  of  merging  on  a  delete 
(the  simulation  and  the  ghost  model  predict  that  Pm  is  almost  zero  if  g  <  .5).  Because  the  predicted 
UtiUzation  is  close  to  the  actual  utiUzation,  the  rule-of-thumb  requires  that  the  overestimate  of  merges 
must  be  balanced  by  an  overestimate  of  splits.  However,  if  g  is  small  enough  {q  <  .4),  then  the  Unear 
model  gives  good  estimates  of  Pm  and  P, . 

3     Merge-at-Half  B-trees 

We  next  analyze  merge-at-half  B-trees.    The  analysis  of  merge-at-half  B-trees  is  much  more  difficult 
than  the  analysis  of  merge-at-empty  B-trees  because  a  half-empty  node  may  merge  with  any  other  node. 
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Therefore,  we  now  present  only  an  approximate  analysis  for  the  purpose  of  comparing  to  merge-at-empty 
B-trees.  The  ghost  model  of  merge-at-half  B-trees  is  presented  in  the  appendijc.  The  following  analysis 
is  similar  to  the  analysis  of  [Quitzow  80],  who  model  the  B-tree  nodes  using  differential  equations. 
However,  [Quitzow  80]  examines  only  the  cases  of  pure-insert  and  pure-modify.  For  a  comparison,  we 
need  to  know  the  utilization  for  the  entire  range  of  q,  because  merges  might  (and  in  fact  do)  cause  the 
utiUzation  to  increase  when  there  are  deletes  in  the  instruction  mix.  Also,  [Quitzow  80]  does  not  calculate 
the  restructuring  probabilities. 

3.1      Analysis 

The  first  step  in  the  analysis  is  to  specify  how  the  tree  is  modified  on  an  insert  or  a  delete.  The  action 
on  an  insert  is  the  same  that  for  a  merge-at-empty  B-tree. 

(A'p, . . . ,  A'i,  -Yi+i , . . . ,  A'2p-i)  — '     (A'p, . . . ,  AT,-  -  1,  X^+i  -I-  1, ... ,  Aop-i) 
(Ap, . . . ,  A.p-i)  ^  (A'p  +  2, ... ,  A2P-1  -  1) 

The  action  on  a  delete  is  also  the  same  as  that  for  a  merge-at-empty  B-tree  if  no  node  gets  merged 
(no  delete  from  an  order  p  node). 

(Ap, . . . ,  A{,  Xi+i, . . . ,  A2p_i)  — ♦  (Ap, . . . ,  Ai  -I-  1,  A.+i  -  1 A2p_i) 

If  a  delete  is  directed  to  an  order  p  node,  then  the  node  gets  merged  with  its  neighbor.  Thus  the 
action  of  a  delete  that  causes  a  merge  depends  on  the  sibhng  of  the  merged  node.  Suppose  that  the 
sibling  is  an  order  j  node.  Then 

(A:p,...,A'2p-i)-  (Ap-2,...,A2p-i  +  l)  j  =  P 

(Ap, . . . ,  A(p+j_i)/2. . . . ,  Aj, . . . ,  A2p_i)  — ► 

(Ap-  1 A,p+j_i)/2-f  2,  ...,X;  -  l,...,X2p-i)  P+ j-  1  is  even 

(Ap, .. . ,  A'(p+_,)/2_i,  A(p+j)/2,  ••  ■ ,  Aj, .. . ,  A2p_i)  — 

(Ap  -  1, . . . ,  A(p+,)/2-i  +  1,  -V(p+,)/2  +  1, . . . ,  Aj  -  1, . . . ,  A2p-i)     p  +  ;■  -  1  IS  odd 

Note  that  not  every  node  order  has  the  chance  of  gaining  on  a  merge  operation.  In  particular,  if  p 
is  even,  then  (3p  —  2)/2  is  the  largest  order  of  nodes  that  will  gain;  if  p  is  odd,  then  (3p  —  l)/2  is  the 
largest  order  of  nodes  that  will  gain. 

Since  the  structure  of  the  evolution  of  the  B-tree  is  different  if  p  is  odd  or  even,  let  us  examine  first 
the  case  when  p  is  even.  An  order  t  node  will  be  merged  with  its  neighbor  if  its  neighbor  is  of  order  p  and 
receives  a  delete  (assume  a  node  merges  with  its  right  neighbor,  except  if  the  node  is  the  rightmost  node, 
in  which  case  it  merges  with  its  left  neighbor).  The  probability  that  a  node  of  order  i  is  the  neighbor  of 
the  merging  node  is  Ai(L)/V{L).  The  result  of  the  merge  operation  is  two  nodes  of  order  {i  -\-  p—  l)/2, 
or  a  node  of  order  (i  -I-  p)/2  and  a  node  of  order  {i  +  p  —  2)/2.  Therefore,  one  node  of  order  j  will  be 
created  if  a  node  of  order  2j  —  p  or  a  node  or  order  2i  —  p  -I-  2  is  merged,  and  two  nodes  of  order  i  will 
be  created  if  a  node  of  order  2j  -  p  -I-  1  is  merged.  If  j  =  (3p  -  2)/2,  then  2j  -  p  -t-  2  =  2p,  so  nodes  of 
order  (3p  —  2)/2  can't  be  created  from  merges  involving  nodes  of  order  2 j  —  p  -I-  2.  If  the  neighboring 
node  involved  in  the  merge  is  of  order  p,  then  a  node  of  order  2p  —  1  is  created.  So,  if  j  =  p,  then  nodes 
of  order  j  aren't  created  from  merges  with  nodes  of  order  2j  —  p  =  p. 

In  order  to  simplify  the  analysis,  approximate  the  probability  that  an  insert  is  directed  to  an  order  » 
node  by 

JML) 

L(l-2q) 
Using  the  approximation  to  the  probability  of  inserting  at  an  order  i  node,  we  have: 
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Theorem  5  If  p  ts  even  and  we  use  ihe  linear  approiimation.  then  tn  a  merge-at-half  B-tree  the  a,, 
1  <  i  <  p  satisfy  the  following  set  of  nonlinear  equations: 

0  =      -(p+(l  -2?))ap  +  9(p+  l)ap+i  +2(1  -g)(2p-  l)a2p-i 

+  i^(-ap  +  2ap+i+ap+2)  '  =  P 

0=  -(i  +  (l-2g))a, +9(1  +  l)a,+i  +(1  -  q){i  -  l)a,_i 

3p-  2 

+  i£-i(-a,  +  a2,_p  +  2a2.-p+i  +  a2.-p+2)  p  <  i  < 

0=  -(i  +  (l-29))a,+9(j+  l)a,+  i  +  (I  -  q)(i  -  \)a,.i 

+  i^(-a, +  a2._p  +  2a2._p+i) 

0  =  -(i  +  (1  -  29))a,  +  q(i  +  l)a,+i  +  (1  -  q)(i  -  l)a._i 

-  i£^  3p/2  <  i  <  2p  -  1 

0  =  -(i  +  (1  -  2q))a,  +  (1  -  9)(i  -  l)a._i 


2 

3p-2 
2 


'pi 

»2p-l 


i  =  2p  -  1 


0=  i-E;rp'-. 

When  p  is  odd,  the  equations  that  describe  the  a,  are  the  same  as  when  p  is  even,  except  that  that 
0(3p-i)/2  is  the  highest  order  node  that  can  gain  on  a  merge,  and  only  from  an  order  2J  —  p  node: 

Theorem  6   If  p  ts  odd  and  we  use  the  linear  approzimaiion,   then  «n   a  merge-at-half  B-tree  the  a^, 
1  <  i  <  p  satisfy  the  following  set  of  nonlinear  equations: 

0=      -(p+(l-29))ap  +  9(p+l)ap+i+2(l  -g)(2p-l)a2p-i 

+  i^(_ap  +  2ap+i +ap+2)  «  =  P 

0  =  -{i  +  (1  -  2q))a,  +  q{i  +  l)a,  +  i  +  (1  -  q){i  -  l)ai_i 

_^iP±L[-a,  +  a2,_p  +  2a2.-p+i  +  a2,_p+2)  p<  i  < 


0  =  -(i  +  (1  -  2q))a,  +  q(i  +  l)ai+i  +  (1  -  g)(i  -  l)ai_i 

+  2£^(-a.+a2._p) 


2 
3p-  1 


-p 
0  =  -(«■  +  (!-  29))a.  +  q(i  +  l)a.+  i  +  (1  -  q)(i  -  l)a._i 


2 

3p-  1 
2 


<  J  <  2p  -  1 


0=  _(i  +  (i_25))a.  +  (l-9)(.-l)a._i 

+  i£2^(-a.  +  ap)  ,  =  2p-l 

0=  1-E; 


2p-l 
=P 


la; 


The  a^  are  also  subject  to  the  condition  that  X2  "'i  —  1-    ^^  solved  the  above  set  of  non-linear 
equations  with  the  numerical  analysis  package  NAG  [NAG]. 

3.2      Comparison 

We  modified  the  merge-at-empty  simulator  to  make  a  merge-at-half  B-tree  and  performed  experiments 
to  compare  against  the  results  of  our  analysis.  In  tables  8,  9  and  10,  we  compare  the  results  from  the 
analysis  and  the  simulations. 

The  utilization  predicted  by  the  analysis  and  by  the  simulation  agree  well,  although  the  difi"erence 
increases  as  q  approaches  .5.  Surprisingly,  the  space  utilization  increases  as  deletes  become  more  common 
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in  the  instruction  mix.  The  utilization  increases  because  a  merging  node  will  increase  in  size.  However, 
the  increase  in  utilization  is  small,  as  the  utilization  never  goes  above  71%. 

The  space  utilization  stays  level  for  most  values  of  the  parameter  q,  but  decreases  sharply  as  q 
approaches  .5.  Furthermore,  for  q  =  .5,  the  space  utilization  decreases  as  p  increases.  Both  the  analysis 
and  the  simulation  indicate  that  the  utilization  will  approach  about  60%. 

The  analysis  and  the  simulation  do  not  agree  as  well  on  the  probability  of  splitting  or  merging.  The 
difference  is  small  when  g  =  .1,  but  becomes  larger  as  q  increases.  The  increasing  error  is  due  to  ignoring 
the  effect  of  ghosts,  whose  effect  becomes  more  significant  as  q  increases.  The  analysis  consistently 
underestimates  the  probabihty  of  sphtting  or  merging. 

Next,  we  use  the  results  on  merge-at-empty  B-trees  from  the  previous  section  to  compare  merge-at- 
half  B-trees  against  merge-at-empty  B-trees.  For  the  comparison,  we  used  a  B-tree  with  p  =  40.  In 
Figure  7,  we  compared  the  utilization  of  a  merge-at-half  and  a  merge-at-empty  B-tree.  The  figure  shows 
that  the  utilization  of  both  B-trees  remains  close  for  most  values  of  the  parameter  q.  Up  to  q  =  .45  the 
utilization  of  the  merge-at-half  B-tree  is  less  than  10%  greater  than  the  utilization  of  the  merge-at-empty 
tree,  but  when  q  =  .5,  the  difference  becomes  60% 

In  Figure  8,  we  compare  the  probability  of  restructuring  on  an  operation  (=  q  Pr [merge  on  delete]-(-(l  — 
9)Pr[split  on  insert]).  The  probability  of  restructuring  a  merge-at-half  B-tree  is  20%  greater  when  q  —  l 
and  300%  greater  when  q  =  .45.  When  q  =  .5,  the  rates  cannot  be  compared  because  the  probability 
of  restructuring  a  merge-at-empty  B-tree  becomes  infinitesimal,  but  the  probability  of  restructuring  a 
merge-at-half  tree  becomes  large.  Table  11  also  compares  restructuring  rates  for  several  values  of  the 
parameter  p. 

4     Upper  Levels 

The  upper-level  structure  of  random  B-trees  is  important  for  analyzing  the  performance  of  concurrent 
B-tree  algorithms.  We  will  analyze  two  kinds  of  restructuring  algorithms;  bottom  up  and  top  down.  The 
bottom-up  algorithm  splits  a  non-leaf  node  only  if  it  is  full  and  one  of  its  children  splits.  The  top-down 
algorithm  sphts  any  full  non-leaf  node  that  an  insert  encounters  on  its  path  to  a  leaf.  At  the  leaf  level, 
we  will  assume  that  the  B-tree  uses  merge-at-empty.  Since  merges  are  rare  with  merge-at-empty,  we  will 
ignore  them. 

A  leaf  will  be  defined  to  be  on  level  1.  The  parent  of  the  leaf  is  on  level  2,  and  the  root  is  on  level  /. 

4.1      Bottom-Up  Algorithm 

In  this  analysis,  we  assume  that  nodes  are  merged  only  when  they  become  empty,  and  that  no  preparatory 
operation  are  performed.  We  have  established  that  ghosts  are  necessary  for  modeling  B-trees  in  the 
presence  of  deletes.  However,  we  have  also  shown  that  if  p  is  large  and  q  <  .5,  then  the  merge  rate  is 
vanishingly  small  compared  to  the  split  rate.  We  will  first  deal  with  the  case  g  <  .5  now,  and  handle  the 
case  q  =  .5  later.  Therefore,  there  will  be  very  few  second  level  ghosts  as  compared  to  the  number  of 
second  level  entries,  so  we  do  not  need  to  model  second  level  ghosts. 

Let  A2,i{N)  be  the  expected  number  of  second  level,  order  i  nodes.  Let  Ei{N,q)  be  the  expected 
number  of  items  in  a  leaf-level  node.  We  have  shown  that  Ei{N,  q)  is  independent  of  A^,  so  we  will  drop 
the  dependence  on  A^  in  the  notation  from  now  on.  After  N  modify  operations,  each  of  which  is  a  delete 
with  probability  q  and  is  an  insert  with  probability  1  —  g,  the  expected  number  of  items  in  the  B-tree  is 
A^(l  —  2q).  Using  the  same  probability  distribution  as  for  the  leaf  level,  the  probabihty  that  a  insert  is 
directed  to  a  node  is  proportional  to  the  number  of  items  and  ghosts  that  the  node  covers.  Since  first 
level  merge  operations  are  rare,  deletes  on  the  second  level  are  rare,  so  ghosts  on  the  second  level  have 
negligible  effect.  If  we  eissume  that  the  structure  of  the  first  level  is  independent  of  the  structure  of  the 
second  level,  then  the  number  of  items  and  ghosts  covered  by  a  node  is  proportional  to  the  size  of  the 
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node.  Therefore,  the  probabihty  that  an  insert  is  directed  to  an  order  i,  level  2  node  is: 

iA2,{N)E,{q) 


Pr[/2,.]  = 


N(l-2q) 


A  level  2  order  i  nodes  becomes  an  order  i+  1  node  (or  becomes  2  order  p  nodes)  only  if  a  level  1  node 
splits.  The  probability  that  a  node  splits  is  (1  -  2q)/{{Ei{q)(l  -  q)).  Therefore,  the  probability  that  a 
level  2  order  i  node  gets  inserted  into  on  an  operation  is  iA2,iiN)/(N{l  —  q)).  Since  an  operation  is  an 
insert  with  probability  \  —  q.  and  propagated  deletes  are  rare,  the  equations  that  describe  the  second  level 
nodes  in  a  merge-at-empty  B-tree  are  the  equations  that  describe  the  pure-insert  leaves.  We  can  reach 
the  same  conclusion  for  all  non-leaf  levels  in  the  B-tree  by  inductively  applying  the  above  arguments. 

We  simulated  a  B-tree  and  ran  experiments  to  compare  the  resulting  a^  ,  against  the  prediction. 
The  B-tree  was  given  a  parameterized  mixture  of  insert  and  delete  operations  until  it  grew  to  contain 

120.000  items.  12  snapshots  of  the  second  level  of  the  B-tree  were  collected.  The  snapshots  were  averaged 
together  to  obtain  the  results  in  Table  12. 

Pure  Modify  {q  =  .5)  For  the  case  of  q  =  .5,  or  pure  modify,  we  have  shown  that  almost  no 
restructuring  operations  are  performed.  The  size  of  the  B-tree  does  not  increase  if  g  =  .5,  so  the 
tree  must  have  grown  from  an  instruction  mix  where  q  <  .5.  Since  the  previous  section  shows  that  the 
structure  of  the  second  level  of  a  B-tree  is  independent  of  ?,  except  for  scaling,  the  above  results  hold  for 
the  case  of  9  =  .5  in  the  short  term.  Eventually,  all  levels  of  the  pure-merge  B-tree  will  have  a  leaf-level 
pure-modify  structure,  but  since  the  probability  of  merging  a  node  on  a  delete  is  less  than  .00002  if 
p  =  20,  and  a  large  number  of  modifies  is  needed  to  before  the  equilibrium  state  is  reached,  the  upper 
levels  of  even  pure-modify  B-trees  will  have  a  pure-insert  distribution  of  nodes  sizes  in  practice. 

4.2     The  Top-Down  Tree 

4.2.1  Second  Level  Analysis 

In  a  top-down  B-tree,  preparatory  operations  are  performed  on  the  path  from  the  root  to  the  leaf.  An 
insert  operation  that  comes  across  a  a  full  node  splits  the  node  and  a  delete  operation  that  comes  across 
a  near-empty  node  merges  it.  Since  propagated  merge  operations  are  rare,  we  will  concentrate  on  the 
effect  of  the  preparatory  spUt  operations. 

Since  preparatory  operations  will  split  a  full  node,  the  B-tree  model  must  be  slightly  modified;  a  full 
node  will  split  into  a  node  of  order  p  and  a  node  of  order  p  —  1. 

Let  /2,i  be  the  event  that  an  insert  operation  passes  through  a  level  2,  order  i  node  on  its  path  to  a 
leaf  Let  S2,i  be  the  event  that  an  insert  operation  propagates  a  split  to  an  order  j,  level  2  node.  Then: 

Pr[/3,.]  =  '^^■•(^)^'(') 


N{l-2q) 
Fori  =  p-  l,...,2p-2,  Pr[S2,i]  =  pi,,  Pr[/2,,],  so 

Pr[52..]=  Llii,a2..(N)     .•  =  p-l,...,2p-2 
Since  a  full  node  is  split  every  time  an  insert  operation  reaches  it, 

Pr[52,2p-i]  =  {2p-l)a2,2p-i{N)E,{q) 
The  recurrence  equations  that  describe  the  the  02  i  are: 

A24N  +  1)  =     (1  -  q){A2AN)  -  (l)'-^^  +  (1)^-'(\1V)n'^^)  +  q(A2AN)  -  (1)(0)  +  (1)(0)) 
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except  for  i  =  p  —  1,  i  =  p.  and  i  =  2p  —  1.  For  z  =  2p  —  1,  the  recurrence  equation  is: 

^2,2p-i(iV  +  1)  =     (1  -  g)  (^A2,,p-i{N)  -  (2p-l)E,(q)a2.2p-i(N)  +  VE^(2p-  2)a:,2p_,(.V)) 

After  deriving  the  equations  for  aip-i   and  oo  p  and  solving  for  the  equilibrium  point,  we  get  the 
system: 

0  =  -pa2,p  +  (rf^(2p  -  l)£'i(9))  a2,2p-i 

0=     -(p+  l)a2,p  +  (p-  l)a2,p-i  +  (iif^(2p-l)£'i(5))a2,2p-i 

0=  -(i+ 1)02,, +  (J-  l)a2,,-i  I  =  p+ l,...,2p- 1 

0=  -  [r^(2p-l)^i(9)+l]<i2,2p-i+(2p-2)a2,2p-2 

In  order  to  make  the  system  of  equations  more  tractable,  make  the  substitution  P2,i  =  ia-ii'- 

0=  _pP2p_i+   (^(p_l)£l(,))p2,2p-l 

0=  _(p+l)P2p+pP2p_i  +   (J£^p£l(,))p2,2p-1 

0=  -(i+l)P2,.  +  «/'2..-l  J  =  p+l,...,2p-l 

0=  -   [rflV(2p-l)£:i(?)  +  l]P2,2p-l+(2p-l)P2,2p-2 

Let  a  =  {{\  —  q)/{\—2q)){2p—\)Ei[q)-'r\.  Then,  derive  the  formulae  for  the  Pi  as  in  the  pure-insert 
case. 

^2.'=  rfT^2,2p-i  f  =  p,...,2p-2 

subject  to 

2p-l 

Y,    P2,  =  l/E,{q) 
•=p-i 

Summing  the  P2,i>  we  get: 

P2,2p-1  =       (^l(<?)  [{^(2p-l)E,{q)  +  l)  (//(2p-  1)  -  H(p))  +  1  +  ^g|^2,7^gl(g)])"' 
/'2,p-l=  (i^^B^£l(9))P2.2p-l 

i  =  p,...,2p-2 

We  simulated  a  B-tree  with  preparatory  operations  and  ran  experiments  to  compare  the  resulting  02, < 
against  the  prediction.  The  B-tree  wcis  given  a  parameterized  mixture  of  insert  and  delete  operations 
until  it  grew  to  contain  120,000  items.  12  snapshots  of  the  second  level  of  the  B-tree  were  collected.  The 
snapshots  were  averaged  together  to  obtain  the  results  in  Table  13. 

The  values  of  p  and  q  have  an  effect  primarily  on  nodes  of  order  2p  —  1.  From  the  formula  for 
•P2,2p-ii  we  can  see  that  as  q  approaches  .5,  1/(2?  —  1)  becomes  very  large,  so  that  F2,2p-i  becomes  very 
small.  Also,  P2,2p-i  depends  on  llE\{q),  so  that  P2,2p-i  the  ratio  of  P2,2p-i  and  P2,i  decreases  as  1/p. 
Therefore,  if  p  is  large,  full  upper  level  nodes  are  rare  in  a  top-down  B-tree,  a  desirable  characteristic  for 
a  concurrent  B-tree. 
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4.2.2      Level  3  and  Beyond 

The  analysis  of  the  third  and  higher  levels  of  a  top-down  B-tree  proceeds  as  the  second  level  analysis 
does,  inductively.  We  find  that; 

Pr[52,,]=  T^iasAN)  t^p 2p  -  1 

=       (2p-l)a3,(2p-l)(^)-£^l(<?)^2(<?) 

Thus,  the  equations  that  describe  the  P3 ,  are  the  same  as  those  that  describe  the  P2,,,  except  with 
Ei(q)E2(q)  substituted  for  every  occurrence  of  £1(9). 

The  space  utilization  of  the  higher  levels  of  a  top-down  tree  can  be  calculated  using  Lemma  2.  Table 
14  shows  the  utilization  of  several  levels  of  a  top-down  tree  for  various  values  of  q,  and  assuming  p  =  30. 

4.3      Root  Analysis 

Two  important  questions  about  the  root  of  the  B-tree  are  what  level  is  the  root  on  and  how  many  children 
does  it  have.  We  will  provide  formulae  for  calculating  these,  but  the  answers  will  depend  on  the  number 
of  items  in  the  B-tree. 

4.3.1      Bottom-Up  Algorithm 

Let  us  begin  by  asking  the  question:  suppose  the  root  is  on  level  h  and  has  i  children.  What  is  the 
expected  number  of  items  in  the  tree? 

Let  Nfi{h,  i)  be  the  number  of  items  covered  by  an  order  i,  level  h  root,  and  let  ER{h,  i)  =  E[Nfi(h, »')]. 
From  section  4.1.2,  a  node  on  level  j  will  hold  an  expected 

Ej  =     EjEj-i-  ■  ■  E\ 
E'-^Ei 

Where 

E  =  2p[H{2p)  -  H(p)] 

If  we  assume  that  the  children  of  the  root  have  a  steady-state  distribution  of  node  sizes,  then  a  B-tree 
with  a  level  h,  order  i  root  will  hold  an  expected 

ER(h,i)  =  iE'-'£i 

If,  however,  i  =  2  and  you  know  that  the  root  has  recently  spUt  then  the  expected  number  of  items  in 

the  B-tree  is  Enih  -  1,  2  *  p),  because  the  two  children  of  the  root  should  both  be  of  order  p. 

Next,  we  want  to  turn  these  predictions  around  so  that  we  can  answer  the  question:    if  a  B-tree 

contains  N  items,  what  is  the  level  and  the  order  of  the  root? 

In  order  to  answer  this  question,  it  is  instructive  to  know  the  variance  of  Nf{(h,  i),  Vfl(/i,  »)• 

Our  starting  point  is  the  following  formula.  Let  A'^  be  a  non-negative  integer  valued  random  variable 

and  let  Xi  be  independent  random  variables.  Then  ([Ross  84]): 


;v 


E^. 


=  E[N]V[X]  +  (E[X]fV[N] 


Let  V/  be  the  variance  of  number  of  items  covered  by  a  level »  node.  Suppose  you  knew  f;,._j  and  Vi'_i. 
The  number  of  items  covered  by  a  level  h  node  is  the  sum  of  the  number  of  items  covered  by  its  children. 
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The  number  of  children  of  a  level  h  node  is  a  random  variable  A(h),  where  Pi[A{h}  =  i]  =  a/,,,/(X2  ah,j] 
From  section  5.2,  the  expected  value  and  variance  of  A{h).  E  and  V,  are: 


£•  = 


1 
2p[//(2p)  -  H{p)] 


=     2p(p-[i/(2p)-i/(p)]) 
Therefore,  we  have  the  following  recursive  equation  for  E^  and  V'';,: 

Table  15  lists  E'j^  and  V'^'  for  p  =  10  and  assuming  pure  inserts.  The  variance  of  the  distribution  is 
large  compared  to  the  mean. 

With  this  information,  we  can  try  to  approximate  the  number  of  items  in  the  B-tree  if  the  root  is  at  a 
certain  height  and  on  a  certain  level.  By  using  the  mean  and  variance  (and  higher  moments,  if  desired), 
one  can  compute  bounds  on  the  probability  that  the  root  is  on  a  certain  level  and  of  a  certain  order. 
For  a  rougher  estimate,  a  simpler  mechanism  can  be  used:  For  every  (/i,  i),  calculate  £/?(/»,  «)■  Assign  to 
each  {h,i)  a  partition  of  the  positive  integers  by: 

2  <  J  <  2p-  1:   (/i,i)  is  assigned  the  interval  [{ER{h,i-  1)  +  ER{h,i))/2,(EH(h,i)  +  E[i{h,i  +  l))/2] 

i=  2:   (h,i)  is  assigned  the  interval  [{ER(h  -  l,2p-  1)+  ER{h-  1,  2p))/2,  (E/jC/i,  2)  +  En(h,3))/2] 

In  order  to  estimate  the  root  height  and  order,  find  the  interval  that  A^  falls  in,  and  use  the  (h,i) 
assigned  to  the  interval  as  the  answer.  Table  16  compares  the  {h,i)  estimated  by  this  method  against 
root  height  and  order  from  a  simulated  B-tree  for  various  values  of  £"«(/»,  »)■ 

The  above  analysis  assumes  that  the  distribution  of  node  sizes  of  the  children  of  the  root  can  be 
closely  approximated  by  the  steady-state  distribution.  However,  the  children  of  the  root  are  new  enough 
that  the  transients  in  the  node  size  distribution  have  not  died  out.  The  transient  distribution  of  the 
node  sized  of  the  children  of  the  root  is  approximated  in  [Driscoll  87].  Although  [DriscoU  87]  uses  an 
approximation  to  the  transient  distribution,  it  can  more  accurately  track  the  size  of  the  B-tree  when 
a  child  of  the  root  sphts.  The  analysis  in  [Driscoll  87]  assumes  a  pure-insert  B-tree.  Since  the  upper 
levels  of  a  B-tree  with  inserts  and  deletes  acts  hke  a  pure-insert  B-tree  that  grows  at  the  net-insert  rate, 
Driscoll's  analysis  applies  to  a  parameterized  B-tree  also. 

4.3.2     Top-Down  Root  Analysis 

The  root  of  a  toi>-down  tree  can  be  analyzed  in  the  same  way  as  above,  except  values  of  Eh{<])  and  Vj,(g) 
must  be  used,  and  Eit{h,2)  =  £"/?(/»  —  l,2p—  1)  if  the  root  has  recently  split.  Notice  that  E  and  V  are 
now  dependent  on  both  h  and  q.  Table  17  compares  the  estimated  (/», «')  against  a  simulation  for  various 
values  of  ER(h,  i). 

5      Conclusion 

This  paper  has  presented  methods  to  calculate  the  equilibrium  utilization  and  distribution  of  nodes  in  a 
B-tree  when  the  probability  of  an  operation  being  a  delete  ranges  between  0  and  .5.  The  random  B-tree 
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is  modeled  by  a  set  of  simultaneous  recurrence  equations.  By  transforming  the  recurrence  equation  into 
a  difference  equation  and  searching  for  the  equilibrium  point,  the  problem  is  transformed  into  finding 
the  solution  of  a  set  of  equations. 

In  order  to  model  the  situation  when  deletes  are  allowed  into  the  operation  mix,  the  notion  of  a  ghost 
(an  item  that  was  deleted  from  the  B-tree)  was  used.  The  distribution  of  ghosts  in  the  B-tree  affects  the 
input  probability  distribution:  thus  it  must  be  calculated  along  with  the  distribution  of  the  nodes  in  the 
B-tree.  The  ghost  model  achieved  accurate  calculations:  less  than  5%  error  when  q  <  .45  and  p  >  20  in 
calculating  both  the  utilization  and  the  probability  of  splitting  on  an  insert. 

As  q  approaches  .5,  the  distribution  of  nodes  and  ghosts  in  the  B-tree  becomes  harder  to  calculate 
accurately.  When  q  —  .47,  the  error  in  calculating  the  utilization  and  the  probability  of  splitting  becomes 
10%.  When  q  =  .5,  the  error  in  calculating  the  utilization  is  within  10%,  but  the  analytical  probability  of 
splitting  is  inaccurate.  However,  the  simulation  agrees  with  the  model's  prediction  that  the  probability 
of  splitting  or  merging  is  very  small  and  decreases  exponentially  with  p. 

The  major  problem  with  the  ghost  model  is  that  it  is  hard  to  solve.  An  approximation  to  the  ghost 
model  is  presented  that  reduces  the  problem  of  calculating  the  utilization  and  probability  of  splitting 
and  merging  in  a  random  B-tree  to  solving  a  set  of  simultaneous  linear  equations.  This  model  is  very 
accurate  for  a  wide  range  of  the  parameter  q.  However,  when  q  becomes  larger  than  .4,  the  linear  model 
becomes  inaccurate  and  the  ghost  model  must  be  used. 

5.1      Pragmatic  Conclusions 

The  calculations  of  the  utiUzation  of  random  merge-at-empty  B-trees  parameterized  by  p  and  q  shows 
that  U  remains  between  60%  and  70%  for  most  values  of  q  (<  .45)  if  p  is  large  (>  15),  but  drops  to 
39%  when  q  =  .5,  or  if  all  operations  are  modifies.  Trees  in  which  deletes  outnumber  inserts  are  rarely 
interesting  in  practice  and  are  difficult  to  model  in  the  limit.  The  knee  of  the  utilization  curve  gets  closer 
to  q  =  .5  and  becomes  sharper  as  p  becomes  larger. 

The  tendency  of  the  utilization  to  remain  near  69%  can  be  explained  by  the  following  arguments;  If 
there  are  even  just  a  few  more  inserts  than  deletes,  the  B-tree  will  grow  at  the  net  insert  rate  (the  rate 
of  inserts  minus  the  rate  of  deletes).  Furthermore,  nodes  with  more  items  are  more  likely  to  get  hit  by 
a  delete  than  smaller  nodes,  so  that  1)  it  is  hard  for  small  nodes  to  become  smaller  and  2)  larger  nodes 
tend  to  bunch  up  near  69%.  These  trends  can  be  seen  in  Figure  6.  The  curve  ha3  the  same  shape  as  a 
curve  for  a  pure-insert  B-tree.  The  number  of  nodes  with  less  than  p  items  decreases  exponentially  as  p 
decreases,  and  thus  has  little  effect  on  the  expected  node  size.  Since  the  largest  nodes  are  the  most  likely 
to  get  hit  with  a  delete,  the  largest  nodes  are  pushed  downwards.  Fewer  nodes  get  spht  so  fewer  half-full 
nodes  appear,  and  half-full  nodes  tend  to  get  pushed  upwards.  These  tendencies  become  stronger  as  p 
increases,  causing  the  utilization  to  stay  near  the  pure  insert  utilization  even  as  q  approziches  .5. 

The  simulation  and  the  ghost  model  show  that  the  probability  of  merging  a  node  on  a  delete  in  a 
merge-at-empty  B-tree  is  almost  zero.  From  the  parameterized  ghost  model,  we  can  relate  the  utilization 
of  the  B-tree  and  the  probabihty  of  splitting  on  an  insert  by  the  formula: 

P,=  '-'" 


(l-9)(2p-l)f/- 


This  follows  from  the  relation  between  the  rate  of  growth  and  the  utilization  derived  in  section  5.1  by 
assuming  that  the  number  of  merges  is  negligible  and  adjusting  the  formula  to  'per  insert'  instead  of  'per 
operation'. 

As  it  stands,  the  rule  of  thumb  is  not  useful  unless  we  have  either  the  utihzation  or  the  probability  of 
splitting  available.  In  Figures  3  and  4,  we  see  that  the  utiUzation  tends  to  remain  near  the  pure-insert 
level  until  q  is  approximately  .47,  so  we  can  estimate  the  utilization  to  be  68%  regardless  of  p  and  q. 
Figure  5  compares  the  probability  of  sphtting  derived  from  the  rule  of  thumb  against  the  simulation  for 
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varying  q  with  p  =  30  and  p  =  40.   As  can  be  seen,  the  rule  of  thumb  gives  very  good  estimates  of  the 
probability  of  splitting  up  to  at  least  q  =  .45. 

The  tendency  of  merge-at-empty  B-tree  utilization  to  remain  near  the  pure-insert  level  suggests 
that  merging  B-tree  nodes  when  they  are  empty  is  not  a  wasteful  strategy  in  terms  of  storage  and  is 
significantly  better  in  terms  of  restructuring. 

The  upper  levels  of  a  merge-at-empty  bottom-up  restructuring  B-tree  have  the  same  distribution  of 
node  sizes  that  the  leaf-level  of  a  pure-insert  has,  as  intuition  would  suggest.  The  simple  structure  of 
the  upper  levels  of  a  merge-at-empty  B-tree  allows  a  simple  but  accurate  calculation  the  root  level  and 
order. 

In  this  paper,  we  also  solved  an  approximate  model  of  a  merge-at-half  B-tree.  The  simplified  model 
seriously  underestimates  the  probability  of  restructuring.  However,  for  the  purpose  of  comparison  to  a 
merge-at-empty  B-tree,  the  underestimation  is  not  a  problem. 

A  merge-at-half  B-tree  will  always  have  a  space  utilization  of  at  least  50%.  When  all  operations 
are  modify  operations,  or  when  the  number  of  insert  operations  is  the  same  as  the  number  of  delete 
operations,  then  the  utilization  will  be  about  60%.  In  contrast,  a  merge-at-empty  B-tree  has  a  0%  lower 
bound  on  its  space  utilization,  and  will  have  about  39%  utilization  on  a  pure-modify  instruction  mLx. 
However,  the  space  utilization  of  a  merge-at-empty  B-tree  remains  high  if  there  are  just  a  few  more  insert 
operations  than  delete  operations.  Thus,  merge-at-half  usually  buys  little  in  terms  of  space  utilization. 

In  Figure  8,  we  showed  that  the  restructuring  rate  of  a  merge-at-half  B-tree  is  significantly  larger 
than  the  restructuring  rate  of  a  merge-at-empty  B-tree  for  all  values  of  q  >  .1.  For  many  concurrent 
B-tree  algorithms  used  in  practice  ([Bayer  77]),  restructuring  causes  a  serialization  bottleneck.  Thus,  one 
simple  but  important  way  to  increase  concurrency  in  B-tree  operations  is  to  reduce  the  probability  of 
restructuring.  Smce  merge-at-half  buys  little  space  utilization  but  is  expensive  in  terms  of  restructuring, 
we  recommend  that  B-trees  (especially  concurrently  eiccessed  ones)  use  merge-at-empty. 
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7  Appendix 

7.1      Merge-at-Half  Ghost  Model 

In  this  section,  we  will  show  how  to  analyze  a  merge-at-half  B-tree  using  ghosts. 

In  order  to  develop  the  recurrence  equations  that  describe  the  expected  number  of  nodes  of  various 
orders,  we  need  to  know  what  the  expected  change  in  the  node  size  is  on  each  step.  Table  17  lists  the 
possible  changes  in  node  size. 

Using  the  information  Usted  in  Table  17,  we  can  write  down  the  recurrence  equation  for  Ai{L), 
p<i<  (3p-2)/2: 

ML^i)=  ML)  -  g^  -  (1  -  ,f'%i.T^  -^im^^ 

(I+l)(l-2?)a,(L+l)  = 

1(1  -  2q)ai(L)  -  qiOiiL)  -  qCi{L)  -  (1  -  2q)iai{L)  -  qpap{L)'i^ 
-\-q{i+l)ai+i(L)  -I-  qCi.i(L)  +  (i  -  1)(1  -  2(7)a._i(L) 
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gain/ loss 


-1 


probability 

q\A,i  L}' 


ill-2j) 


(1 


CJI'HAaLT 


g)-       Lil-V) 

— FTTrfxirr 
'}Lii-2q)Trr) 


L(i-2jJ_ 


rrrrr 


9) rrri 


pAriL\  Au.p+dL] 


L{l-2q^ 


vTrr 


"  Ul-2q)     _   V[L) 


MliTi 


'Ill-;i;)V(Z.) 


range 


I  -  p, 


2p  -  1 


I  =  p, 


,2p-l 


1  =  p, 


,  2p  -  1 


1-  =  p, . . .  .  2p  -  2 


p+  l,...,2p-  1 


I  =  p 


I  =  p  +  1 , 


I  =  p, . 


!   =  P, 


3p-2 


-  1 


1  =  2p  -  1 


cause 


delete 


insert 


merge 


delete 


insert 


insert 


merge 


merge 


merge 


merge 


Table  18:  Changes  in  values  of  /I,  on  an  operation 

W-^i^i.-piL)  +  2a2._p+,(I)  +  a2._p+o(L)) 
A(I  +  1)(1  -  2q)a,(L)  =  -(.(1  -?)  +  (!-  2<?))a.(L) 

-9C,(L)  +  q{i  +  l)a,+i(Z,)  +  9C._i(I) 
+(.  -  1)(1  -  29)a..,(L)  +  2i^(-a,(L)  +  a2._p(L)  +  2a2._p+i(L)  +  a2i-p+2(i)) 

Solving  for  the  equilibrium  point,  we  get,  for  p  <  i  <  (3p  —  2)/2: 

0  =  -(»(1  -  9)  +  (1  -  2q))ai  -  qci  +  q(i  +  1)9^+1  +  qc,_-^ 

+(«■  -  1)(1  -  2q)a,.i  +  ^(-a,  +  a2,-p  +  2a2,_p+i  +  a2,_p+2) 

The  equilibrium  point  for  the  other  cases  can  be  found  in  a  similar  manner. 

We  next  have  to  generate  the  equations  for  the  ghosts.  Table  18  lists  the  possible  gains  and  losses 
The  entries  are  similar  to  the  entries  in  table  A.  The  gain/loss  caused  by  an  insert/delete  is  the  number 
of  ghosts  in  the  node,  which  is  the  total  number  of  ghosts  in  order  i  nodes  divided  by  the  number  of 
order  i  nodes  =  Ci/A,.  If  a  node  of  order  2»  -  p  +  1  merges  with  an  order  p  node,  then  the  number  of 
ghosts  contained  in  the  order  i  nodes  increases  by  the  number  of  ghosts  contained  in  the  order  p  node 
plus  the  number  of  ghosts  contained  in  the  order  2i  -  p  +  1  node,  plus  the  1  ghost  created.  Assume 
that  when  two  nodes  of  different  sizes  are  created  from  a  merge,  the  smaller  of  the  two  is  the  node  that 
received  entries.  Assume  also  that  the  ghosts  in  a  node  are  uniformly  distributed  between  the  items. 
Then  a  node  of  order  t  will  receive  i/(2i  -  p)  of  the  ghosts  of  an  order  2j  -  p  node,  and  a  node  of  order 
i  will  receive  the  ghosts  of  the  order  p  node,  (i  -  p  +  1  )/(2i  -  p  +  2)  of  the  ghosts  of  an  order  2i  -  p  +  2 
node,  and  the  ghosts  that  is  created. 

Using  Table  18  to  list  the  expected  loss  and  gain  on  a  single  operation,  we  can  generate  the  recurrence 
equation  for  Ct(L),  p  <  i  <  (3p  -  2)/2 

C.(L+1)=       C.iL)--^{qi^^i,.,)CA^^tl^  +  ^^^A^) 
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gain/loss 

probability 

range 
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f,(  Li 

.4,(1) 
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2.-P  A3,_„(L) 
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i<p<(3p-2)/2 

merge 

C>(i-)           C„_,^,^L)          ^ 
>lp(i)     •     /l3,-.+  ,(L)    ^ 

p/(,(L)    >l„-p+,(L) 
''L(1-2^)          V(i.) 
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merge 

Cp(L)           ,-p+l    C„.p  +  .(i.)         J 
^IplI)     '     21-P+2  >l3,_p+3(L)  ^ 

.   p/l,(L)     43,.p  +  3(L) 

'?!(l-2,)          V'UV 

i<p<(3p-2)/2 

merge 

TtrxT  +  i 

„    p/i;(i.) 

z  =  2p  -  1 

merge 

'iUl-2q)Vil) 

Table  19:  Changes  in  values  of  C,  on  an  operation 

_l_^  ( CjpiL)  .  +  P-1    C3.-,.^3(L)  ,\     pA^d)    A^,.r^,:,(L) 


(i+l)?c.(L+l)  = 

Lqc.{L)  -  T^f7I7(9'-a.(L)  +  qc.{L)  ^  i(l  -  2q)a,{L)  +  qp^m^) 
+?(«+  l)a.+i(L)(^^^|^^^ia^  +  1)  +  (i_^^;),'.'.^.U)(gc.-i(I)  +  a  -  1)(1  -  2q)a,.,{L)) 

+  2rr^ ii-2q)v(L)<iP^piLi  + 1(   ' (i-2q)vU) — +  — ('-2l)vay  '  +  ui-lqi^a)' 

,„(  <ep(I)pa3.-p-f3(I)     ,      i-p+\    qpc3,.f^.,{L)ap(L)  P^VAj^^p+j  nn 

A(L  +  l)gc.(L  +  1)  =   -c.(L)(9  +  ^  +  ^li^  +  ig)  +  ^^'^/iV/'^"^ 
+9(i  +  l)a.+,(L)  +  c.-i(L)(^,!'/;-;W^,  +  (i  _  1)5) 
+  (ijXa)  ["""^2'.-^'^^  +  (cp(L)a2.-p+i(L)  +  C2._p+i(L)ap(L)  +  i^ap(I)a2._p+i(L)) 
+  (cpa2.-p+2(I)  +  {'-P+^^^^;y^r-^^)^r(^)  +  l^a,{L)a2,.p^2{L))  -  a,{L)c,iL)] 
We  can  find  the  equilibrium  point  for  p  <  i  <  {3p  —  2)/2  to  get: 

0=  -c,(a  +  -^  +  -^^  +  ia)  +  ?^'"''^)'-+' 

+9(i  +  i)a,+i  +  ci-i(;,!;;-^;;_,  +  (i  - 1)9) 

+  (i!2?)v  ["iT-p"''  +  (<^pa2.-p+i  +  C2.-p+iap  +  i^flpazi-p+i) 
+  (cpa2,_p+2  +  "-"V.-p'V/""'^  +  if^apa2.-p+2)  -  a,c] 
We  can  find  the  equilibrium  point  for  the  other  cases  in  a  similar  manner. 
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p 

utilization 
analytical      simulation 

probabability  of  splitting  (merging) 
analytical       simulation 

5 

43.02% 

45.31% 

4.00-  10-'' 

1 96- 10-' 

10 

39.34 

39.48 

1.00    10-'' 

1.19  ■lO--' 

15 

38  21 

38.40 

4.44    lO-'^ 

8.33    10-* 

20 

37.66 

39.77 

2.50    lO--' 

2.05    10-* 

Table  1:  Merge-at-empty  linear  model  and  simulation  for  pure  modifies:  utilization  and  probabilities  of 
splitting  and  merging 


p 

utiliz 
analytical 

ation 
simulation 

probability 
analytical 

of  splitting  (merging) 
simulation 

5 

47.51% 

45.31% 

1.02    10-' 

1.96    10-' 

10 

42.67 

39.48 

2.54    10-'' 

1.19    lO--' 

15 

4087 

38.40 

6.40-  10-" 

8  33    10-* 

20 

3993 

39.77 

1.60-  10-' 

2.05    10-* 

Table  2:  Merge-at-empty  ghost  model  and  simulation  for  pure  modifies:  utilization  and  probabilities  of 
splitting  and  merging 


p 

utili2 
analytical 

Ation 
simulation 

probability 
analytical 

of  splitting 
simulation 

probability 
analytical 

of  deleting 
simulation 

5 

57.81 

54.29 

3.69- 10-'' 

4.11  •  lO-' 

2.36    10-3 

5.31    10-3 

10 

62.38 

58.34 

1.53    10-' 

1.51  •10-'' 

1.86    10-" 

6.6-  10-* 

15 

64.36 

60.86 

9.74    10-^ 

9.34    10-3 

1.95-  lO-'' 

0 

20 

65.36 

62.39 

7.13    10-^ 

6.82    10-3 

2.44    10-'' 

0 

Table  3:  Merge-at-empty  ghoet  model  and  simulation,  q  =  .45  (10%  more  inserts  than  deletes) 


p 

utiliz 
analytical 

Ation 
simulation 

probability 
analytical 

of  splitting 
simulation 

probability 
analytical 

of  deleting 
simulation 

5 

54.88 

50.31 

2.63    10-' 

3.05    10-' 

3.86    10-3 

1.02-  10-' 

10 

60.40 

53.54 

9.86    10-3 

9.56    10-3 

5.28    10-" 

1.05- 10-" 

15 

63.36 

57.17 

6.16    10-3 

5.78    10-3 

5.40-  10-^ 

0 

20 

64.82 

58.94 

4.47    10-3 

4  13    10-3 

9.80    10-" 

0 

Table  4:  Merge-at-empty  ghoet  model  and  simulation,  q  =  .47  (6%  more  iinserts  than  deletes) 


p 

5 

10 

q  as 
20 

percentage 
30           40 

45 

47 

5       analytical 
5      simulation 

71.12 

71.14 

70.35 
7034 

68.06 
6901 

6391 
66.45 

56.06 
59.75 

50.14 
5429 

47.41 
5031 

10       analytical 
10      simulation 

70  10 
70.51 

69.74 
70.13 

68.64 
6920 

66.54 
67.77 

6038 
63.04 

52.70 
58.34 

48.02 
5352 

20       analytical 
20      simulation 

6968 
69.85 

69.51 
69.98 

68.97 
69.57 

67.90 
68.03 

64.70 
64.86 

58.77 
62.39 

53.12 
58.94 

30       analytical 
30      simulation 

69.55 
69.69 

69.43 
7005 

6908 
69.14 

68.37 
68.60 

66.26 
67.03 

62.07 
64.30 

57.13 

40       analytical 
40      simulation 

69.49 
69.60 

69.40 
71.49 

69.14 
69.11 

68.61 
68.43 

67.03 
67.53 

63.87 
65.06 

59.83 

70       analytical 

69.41 

69.36 

69.21 

68.92 

68.01 

66.21 

63.82 

100      analytical 

69.38 

69  34 

69.24 

69.04 

68.41 

67.15 

65.48 

Table  5:  Merge-at-empty  linear  model  and  simulation  for  varying  node  sizes  ang  percentages  of  deletes: 
utilization 


p 

5 

10 

q 

20 

as  percentag 
30 

e 
40 

45 

47 

5      ana. 
5      sim. 

1.48 
1,48 

10-' 

io-> 

1.40 
1,39 

10-' 
10-' 

1,23 
1,20 

10-' 
10-' 

1,00 
9.26 

10-' 
10-2 

7.21 
6,18 

10-^ 
10-2 

5.62 
4.11 

10-^ 
10-2 

4.97  •  10-^ 
3.05  •  10-2 

10      ana. 
10      sim. 

7,11 
7.06 

10-^ 
10-2 

6,71 
6,68 

lo--" 

10-2 

5,76 
5,71 

lo--" 

10-2 

4,53 
4,46 

10-'' 
10-2 

2,93 
2.71 

10-^ 
10-2 

1.97 
1.51 

10-^ 
10-2 

1.57  10-^ 
9.56    10-3 

20      ana. 
20      sim. 

3.49 
3,47 

10--' 

10-2 

3,28 
3,24 

10--' 
10-2 

2.79 
2,77 

10--' 
10-2 

2,16 
2,13 

10-^ 
10-2 

1,32 
1,29 

10-^ 
10-2 

7.99 
6.82 

10-J 

10-3 

5.72  10-^ 
4.13    10-3 

30      ana. 
30      sim. 

2.30 
2,30 

lO-" 
10-2 

2,17 
2.12 

10-^ 
10-2 

1,84 
1,83 

10-^ 
10-2 

1,42 
1,39 

10-^ 
10-2 

8.53 
8.08 

10--* 

10-3 

4.97 
4.33 

10-J 

10-3 

3.40  ■  10-3 

40      ana. 
40      sim. 

1,73 
1.78 

10-^ 
10-2 

1.62 
1,64 

10-^ 
10-2 

1,37 
1,37 

10-^ 
10-2 

1,05 
1,05 

10-^ 
10-2 

6.29 
5.92 

10-^ 

10-3 

3.60 
3.25 

10-a 

10-3 

2.39    10-^ 

70      ana. 

9,82 

10-J 

922 

10-J 

7,80 

10-1 

5.97 

10-^ 

3,53 

10-* 

1.98 

10-* 

1.28    10-* 

100     ana. 

6,86 

10-^ 

6.44 

10-J 

5,44 

lo--* 

4,16 

10-' 

2.45 

10-^ 

1.36 

10-J 

8.69  •  lO-" 

Table  6:  Merge-at-empty  linear  model  and  simulation  for  varying  node  sizes  and  percentages  of  deletes: 
probability  of  splitting 


p 

5 

10 

20 

q  as  f 
30 

)ercentage 
40 

45 

47 

5 
5 

ana. 
sim. 

3.93  10-' 
0 

7.75-  10-'= 
0 

1.95 
0 

lo-" 

1.63 
0 

10-^ 

9.04  10-3 
2.10  10-3 

1.95- 10-'^ 
5.31  10-3 

2.62  10-'^ 
1.02-  10-' 

10 
10 

ana. 
sim. 

0 
0 

0 
0 

5.04 
0 

io-« 

627 
0 

10-" 

3.17  10-* 
0 

1.89- 10--* 
6.60  10-'° 

3.74  IQ-'^ 
1.05-  10-" 

20 
20 

ana. 

sim. 

0 
0 

0 
0 

0 
0 

3.44 
0 

10-'" 

1.45  •  10-^ 
0 

6.74  10-' 
0 

2.98  10-" 
0 

30 
30 

ana. 
sim. 

0 
0 

0 
0 

0 
0 

0 
0 

1.15  lO-" 
0 

4.16-  10-^ 
0 

4.12-  10-* 
0 

40 
40 

ana. 

sim. 

0 
0 

0 
0 

0 
0 

0 
0 

1.14-  10-'" 
0 

3.21  10-' 
0 

7.14-  10-" 
0 

70 

ana. 

0 

0 

0 

0 

0 

2.65-  10-'° 

6.63- 10-" 

100 

ana. 

0 

0 

0 

0 

0 

0 

9  07  10-'" 

Table  7:  Merge-at-empty  linear  model  and  simulation  for  varying  node  sizes  and  percentages  of  deletes; 
probability  of  merging 


p 

q=.\ 

9  =  .2 

9=3 

Utilizati 
9=4 

Dn 
9  =  .45 

9  =  .47 

9=5 

20   ana. 
20   sim. 

70.09 
70.09 

70.36 
70.31 

70.48 
70.34 

69.70 
70.24 

67  92 
69.24 

66.43 
68.33 

62.91 
66.07 

30   ana. 
30   sim. 

69.98 
70.41 

70.37 
70.10 

70.72 
70.54 

70.41 
69.98 

68.93 
69.85 

67.37 
68.98 

61.43 
64.47 

40   ana. 
40   sim. 

69.94 
70.05 

70.38 
70.49 

70.86 
70.49 

70.86 
70.63 

69.66 
69.59 

68.22 
69.54 

60.51 
63.30 

Table  8:  Comparison  analytical  and  simulation  space  utilization  for  merge-at-half  B-trees 


p 

9  =  .l 

9=2 

Pr 
9=3 

split  on  i 
9=4 

nsert] 
9  =  .45 

9  =  .47 

9=5 

20   ana. 
20   sim. 

.0331 
.0340 

.0284 
.0321 

.0221 
.0279 

.0136 
.0201 

.00852 
.0144 

.00669 
.0132 

.00711 
.0118 

30   ana. 
30   sim. 

.0218 
.0225 

.0185 
.0211 

.0142 
.0178 

.00843 
.0128 

.00483 
.00584 

.00329 
.00491 

.00226 
.00521 

40   ana. 
40   sim. 

.0162 
.0170 

.0138 
.0153 

.0105 
.0133 

.00615 
.00939 

.00344 
.00443 

.00224 
.00331 

.00126 
.00226 

Table  9:   Comparison  analytical  and  simulation  probability  of  splitting  on  an  insert  for  merge-at-half 
B-trees 


p 

9=.l 

9=2 

Pr[merge  on 
9=3   9=4 

delete] 
9  =  .45 

9  =47 

9  =  .5 

20   ana. 
20   sim. 

.0644 
.0691 

.0571 
.0645 

.0478 
.0580 

.0384 
.0494 

.0391 
.0446 

.0450 
.0483 

0762 
.0572 

30   ana. 
30   sim. 

.0428 
.0452 

.0372 
.0432 

.0299 
.0379 

.0210 
.0305 

.0181 
.0193 

.0193 
.0163 

.0465 
.0330 

40   ana. 
40   sim. 

.0320 
.0318 

.0276 
0320 

.0217 
.0283 

.0143 
.0215 

.0109 
.0122 

.0107 
.0110 

.0324 
.0193 

Table  10:    Comparison  analytical  and  simulation  probability  of  merging  on  an  delete  for  merge-at-half 
B-trees 


p 

9  =  .l 

9=2 

Pr[restri 
9=3 

icture  on 
9=4 

operation 
9  =  .45 

1 
9  =  .47 

9=5 

20   half 
20   empty 

.0362 
.0291 

.0341 
.0221 

.0298 
.0149 

.0235 
.00774 

.0223 
.00378 

.0247 
.00223 

.0417 
.00005 

30   half 
30   empty 

.0261 
.0190 

.0224 
.0146 

.0189 
.00973 

.0135 
.00513 

.0108 
.00283 

.0108 
.00180 

.0244 
.00000 

40   half 
40  empty 

.0178 
.0147 

.0166 
.0109 

.0139 
.00735 

.00941 
.00355 

.00680 
.00178 

.00622 
.00125 

.0168 
.00000 

Table  11:  Comparison  of  merge-at-half  and  merge-at-empty  probability  of  restructuring  in  an  operation 

iqPm  +  {l-q)P.) 


analysis 

9  =  0 

simulatior 
9=.l   9=2 

1 
9=3 

9  =  ,4 

iio 

.0136 

.0134 

.0144 

.0131 

.0128 

.0130 

flu 

.0113 

.0104 

.00972 

.0106 

.0105 

.0110 

"12 

.00959 

.00934 

.00834 

.00910 

.00948 

,00930 

ai3 

.00822 

.00767 

.00739 

.00851 

.00787 

,00853 

ai4 

.00712 

.00723 

.00691 

.00707 

,00754 

,00630 

015 

.00623 

.00622 

.00606 

.00692 

,00633 

,00649 

ai6 

.00550 

.00576 

.00655 

.00560 

,00584 

.00535 

117 

,00489 

.00550 

.00509 

,00465 

,00553 

.00533 

a\s 

.00437 

.00434 

.00490 

.00484 

,00434 

,00440 

ai9 

.00393 

.00432 

.00454 

.00388 

,00388 

.00448 

Table  12:  Comparison  analytical  vs,  simulation  values  of  ao,,  for  various  9  (distribution  of  second  level 
nodes,  bottom-up  algorithm) 


?  - 

ana 

=  0 
sim 

9  = 
ana 

:  ,1 
sim 

9= 
ana 

:.2 

sim 

9  = 
ana 

=  3 
sim 

?= 
ana 

4 
sim 

09 

,0078 

,0071 

,0078 

,0068 

,0078 

,0069 

,0079 

,0072 

,0079 

,0070 

Qio 

,0136 

,0118 

,0136 

,0121 

,0136 

.0135 

,0136 

,0134 

,0136 

,0127 

an 

,0113 

,0116 

,0113 

,0119 

,0113 

,0108 

,0113 

,0109 

,0114 

,0112 

012 

.0096 

,0097 

,0096 

,0092 

,0096 

,0091 

,0096 

,0097 

,0096 

,0094 

013 

,0082 

,0083 

.0082 

,0084 

,0082 

,0085 

,0082 

,0080 

,0082 

,0089 

014 

,0071 

,0074 

,0071 

,0070 

.0071 

,0072 

,0071 

.0073 

,0071 

,0070 

Ol5 

,0062 

.0063 

,0062 

,0070 

,0062 

,0061 

.0063 

.0064 

.0062 

,0064 

016 

,0055 

,0055 

,0055 

,0057 

,0055 

,0060 

,0055 

,0055 

,0055 

,0059 

Ol7 

,0049 

,0050 

,0049 

,0053 

,0049 

,0052 

,0049 

,0052 

,0049 

,0047 

O18 

,0044 

,0049 

,0044 

,0043 

,0044 

,0046 

,0044 

.0045 

,0044 

,0047 

Ol9 

,0003 

,0002 

,0003 

.0002 

,0002 

,0003 

,0002 

.0002 

,0001 

,0002 

Table  13:  Comparison  analytical  vs.  simulation  values  of  a2,i  for  various  9  (distribution  of  second  level 
nodes,  top-down  algorithm) 


utilization 

h 

9  =  0 

9  =  .2 

9  =  ,4 

2 

68,473 

68,470 

68,465 

3 

68.460 

68,460 

68.460 

4 

68,460 

68,460 

68,460 

Table  14:  Space  utilization  of  various  levels  of  a  top-down  tree,  p  =  30 


h 

E'. 

v^^' 

< 

1 

13,4 

7.73 

2.78 

2 

179 

1486 

38.5 

3 

2394 

2.67  X  lO* 

517 

4 

32019 

4.79  X  10' 

6919 

5 

4.28  X  10* 

8.56  X  10^ 

92543 

Table  15:  Variance  of  distribution  of  number  of  items  covered  in  a  bottom-up  B-tree,  p  =  10 
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8 

5 
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5 

10 

291047 

5 

11 
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5 

14 
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5 

12 

Table  16;  Comparison  of  predicted  and  simulated  root  height  and  order  for  various  Er{h,  i) 
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